Fix accuracy of max-pooling backpropagation for bfloat16 data #2386

asimonov1 · 2025-01-13T14:51:45Z

MFDNN-11050 bf16 backward max pooling returns incorrect results
MFDNN-11396 BF16 pooling_backward performance regression on SPR

As a result of a refactoring, MFDNN-12863 (JIT max pool implementation works incorrectly for small data types and large kernels) was also fixed.
Also, as it was recommended during the review, io injector is used to load/store tensor data.

It was found earlier (MFDNN-11050) that bf16 backward max pooling returns incorrect results. An initial fix of an accuracy led to significant performance regression (MFDNN-11396). That initial fix was rolled back.

The reason of an accuracy issue is that even a sum of relatively small numbers is not accurate, e.g. bf16(256.0)+bf16(1.0) is bf16(256.0). Summation can take place if some pooling strides are less than corresponding kernel sizes.

The current fix uses additional accumulation arrays of f32's, with one array per thread. The size of those arrays for src_diff is the same as for existing ncsp implementation (ncsp implementation creates arrays of f32's for dst_diff, src_diff and indices, reorders data and uses those arrays during calculations). The ncsp case is not affected by this PR (except changed load/store functions).

I have done some manual measurements on a machine with SPR processor. In some cases this implementation works faster than the original version, sometimes slower, but significantly better than not-optimized implementation (that was used after the first fix of MFDNN-11050).

The following tables contain performance data for axb and aBx16b layouts for the original implementation (main branch), the fixed version (this PR), and another implementation (that is used if the optimized implementation is skipped).

Scratch of a script used to run tests:

export KMP_AFFINITY=granularity=fine,compact,1,0
export OMP_NUM_THREADS=56
...
export LD_PRELOAD=${libiomp5_loc}/libiomp5.so
numactl --physcpubind=0-59 --membind=0 ./benchdnn -v5 --mode=p --pool --reset --allow-enum-tags-only=0 --engine=cpu --dir=BWD_D --alg=pooling_max --dt=bf16:bf16 --tag=<tag> <problem>

axb

problem	original (ms)	fixed (ms)	other (simple_nhwc) (ms)
mb200_ic32_id20ih40iw30_od18oh38ow28_kd3kh3kw3	13	13	451
mb200_ic35_id20ih40iw30_od18oh38ow28_kd3kh3kw3	29	22	1685
mb200_ic32_id40ih40iw30_od10oh35ow25_kd4kh6kw6_sd4sh1sw1	16	19	857
mb200_ic35_id40ih40iw30_od10oh35ow25_kd4kh6kw6_sd4sh1sw1	55	33	3960
mb128ic128_ih112oh56kh3sh2_iw112ow56kw3sw2	6.7	14	117
mb512_ic512_iw2048kw129sw1ow1920	142	88	2480
mb128ic64_ih112oh56kh3sh2dh0ph0_iw112ow56kw3sw2dw0pw0 (from MFDNN-11396)	1.85	5.34	67

aBx16b

problem	original (ms)	fixed (ms)	other (ref) (ms)
mb200_ic32_id20ih40iw30_od18oh38ow28_kd3kh3kw3	13	7	310
mb200_ic35_id20ih40iw30_od18oh38ow28_kd3kh3kw3	20	10	338
mb200_ic32_id40ih40iw30_od10oh35ow25_kd4kh6kw6_sd4sh1sw1	16	14	155
mb200_ic35_id40ih40iw30_od10oh35ow25_kd4kh6kw6_sd4sh1sw1	25	19	177
mb128ic128_ih112oh56kh3sh2_iw112ow56kw3sw2	3.6	3.7	147
mb512_ic512_iw2048kw129sw1ow1920	121	60	1310
mb128ic64_ih112oh56kh3sh2dh0ph0_iw112ow56kw3sw2dw0pw0 (from MFDNN-11396)	1.33	1.55	73

asimonov1 · 2025-01-14T08:50:10Z

make test
disable device_gpu
disable benchdnn_all
enable benchdnn_pool
enable benchdnn_nightly

asimonov1 · 2025-01-14T09:03:43Z

make test
disable benchdnn_all
disable test_device_gpu
disable build_gpu_runtime_ocl
disable build_gpu_runtime_sycl
enable benchdnn_nightly
enable benchdnn_pool
enable arch_cpu_adl
enable arch_cpu_clx
enable arch_cpu_dmr
enable arch_cpu_gnr
enable arch_cpu_hsw
enable arch_cpu_nhm
enable arch_cpu_nvl
enable arch_cpu_skx
enable arch_cpu_snb
enable arch_cpu_spr
enable arch_cpu_srf