Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix accuracy of max-pooling backpropagation for bfloat16 data #2386

Merged
merged 6 commits into from
Mar 12, 2025

Conversation

asimonov1
Copy link
Contributor

@asimonov1 asimonov1 commented Jan 13, 2025

MFDNN-11050 bf16 backward max pooling returns incorrect results
MFDNN-11396 BF16 pooling_backward performance regression on SPR

As a result of a refactoring, MFDNN-12863 (JIT max pool implementation works incorrectly for small data types and large kernels) was also fixed.
Also, as it was recommended during the review, io injector is used to load/store tensor data.

It was found earlier (MFDNN-11050) that bf16 backward max pooling returns incorrect results. An initial fix of an accuracy led to significant performance regression (MFDNN-11396). That initial fix was rolled back.

The reason of an accuracy issue is that even a sum of relatively small numbers is not accurate, e.g. bf16(256.0)+bf16(1.0) is bf16(256.0). Summation can take place if some pooling strides are less than corresponding kernel sizes.

The current fix uses additional accumulation arrays of f32's, with one array per thread. The size of those arrays for src_diff is the same as for existing ncsp implementation (ncsp implementation creates arrays of f32's for dst_diff, src_diff and indices, reorders data and uses those arrays during calculations). The ncsp case is not affected by this PR (except changed load/store functions).

I have done some manual measurements on a machine with SPR processor. In some cases this implementation works faster than the original version, sometimes slower, but significantly better than not-optimized implementation (that was used after the first fix of MFDNN-11050).

The following tables contain performance data for axb and aBx16b layouts for the original implementation (main branch), the fixed version (this PR), and another implementation (that is used if the optimized implementation is skipped).

Scratch of a script used to run tests:

export KMP_AFFINITY=granularity=fine,compact,1,0
export OMP_NUM_THREADS=56
...
export LD_PRELOAD=${libiomp5_loc}/libiomp5.so
numactl --physcpubind=0-59 --membind=0 ./benchdnn -v5 --mode=p --pool --reset --allow-enum-tags-only=0 --engine=cpu --dir=BWD_D --alg=pooling_max --dt=bf16:bf16 --tag=<tag> <problem>

axb

problem original (ms) fixed (ms) other (simple_nhwc) (ms)
mb200_ic32_id20ih40iw30_od18oh38ow28_kd3kh3kw3 13 13 451
mb200_ic35_id20ih40iw30_od18oh38ow28_kd3kh3kw3 29 22 1685
mb200_ic32_id40ih40iw30_od10oh35ow25_kd4kh6kw6_sd4sh1sw1 16 19 857
mb200_ic35_id40ih40iw30_od10oh35ow25_kd4kh6kw6_sd4sh1sw1 55 33 3960
mb128ic128_ih112oh56kh3sh2_iw112ow56kw3sw2 6.7 14 117
mb512_ic512_iw2048kw129sw1ow1920 142 88 2480
mb128ic64_ih112oh56kh3sh2dh0ph0_iw112ow56kw3sw2dw0pw0 (from MFDNN-11396) 1.85 5.34 67

aBx16b

problem original (ms) fixed (ms) other (ref) (ms)
mb200_ic32_id20ih40iw30_od18oh38ow28_kd3kh3kw3 13 7 310
mb200_ic35_id20ih40iw30_od18oh38ow28_kd3kh3kw3 20 10 338
mb200_ic32_id40ih40iw30_od10oh35ow25_kd4kh6kw6_sd4sh1sw1 16 14 155
mb200_ic35_id40ih40iw30_od10oh35ow25_kd4kh6kw6_sd4sh1sw1 25 19 177
mb128ic128_ih112oh56kh3sh2_iw112ow56kw3sw2 3.6 3.7 147
mb512_ic512_iw2048kw129sw1ow1920 121 60 1310
mb128ic64_ih112oh56kh3sh2dh0ph0_iw112ow56kw3sw2dw0pw0 (from MFDNN-11396) 1.33 1.55 73

@github-actions github-actions bot added platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64 component:tests Codeowner: @oneapi-src/onednn-arch labels Jan 13, 2025
@asimonov1 asimonov1 force-pushed the asimonov/maxpool_bwd_bf16_accuracy branch from 17ab064 to 2820609 Compare January 13, 2025 18:39
@asimonov1 asimonov1 changed the title Use float32 accumulator for max-pooling backpropagation for bfloat16 data Fix accuracy of max-pooling backpropagation for bfloat16 data Jan 13, 2025
@asimonov1
Copy link
Contributor Author

make test
disable device_gpu
disable benchdnn_all
enable benchdnn_pool
enable benchdnn_nightly

@asimonov1
Copy link
Contributor Author

make test
disable benchdnn_all
disable test_device_gpu
disable build_gpu_runtime_ocl
disable build_gpu_runtime_sycl
enable benchdnn_nightly
enable benchdnn_pool
enable arch_cpu_adl
enable arch_cpu_clx
enable arch_cpu_dmr
enable arch_cpu_gnr
enable arch_cpu_hsw
enable arch_cpu_nhm
enable arch_cpu_nvl
enable arch_cpu_skx
enable arch_cpu_snb
enable arch_cpu_spr
enable arch_cpu_srf

@asimonov1 asimonov1 force-pushed the asimonov/maxpool_bwd_bf16_accuracy branch from 2820609 to 73b85e2 Compare January 14, 2025 16:09
@asimonov1 asimonov1 force-pushed the asimonov/maxpool_bwd_bf16_accuracy branch from 73b85e2 to dce529d Compare January 15, 2025 16:49
@@ -18,6 +18,7 @@
#include <bitset>

#include "common/dnnl_thread.hpp"
#include "common/memory_desc.hpp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one should come with "cpu/cpu_pooling_pd.hpp", thus, not needed as standalone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

}

template <cpu_isa_t isa>
inline void jit_uni_pool_kernel<isa>::load32(const int idx,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already a lot of existing routines to support loading. I highly recommend to change the loading/storing implementation to rely on io_injector, which is more flexible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, loading/storing functions need a refactoring. I did not know how to do that properly, and I did not know about io_injector. I have to investigate that. Could it be done as a separate task?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If doing it as a separate task, I definitely recommend refactor existing implementation first, and then apply f32 accumulation altogether with acc_mode support.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The io injector is now used to load/store tensor data, but indices are processed the same way as before because the io injector does not support these data types: it converts integers to floats during loading (however, it does not convert data if data are stored as s32; s32 and f32 are stored by one common function store_f32; it looks like a bug).

@asimonov1 asimonov1 force-pushed the asimonov/maxpool_bwd_bf16_accuracy branch 2 times, most recently from 8f5d49e to 8d3b54c Compare January 21, 2025 16:22
@asimonov1 asimonov1 force-pushed the asimonov/maxpool_bwd_bf16_accuracy branch 3 times, most recently from 25cbb9a to f5d5a1d Compare January 26, 2025 14:25
@asimonov1
Copy link
Contributor Author

make test
disable benchdnn_all
disable test_device_gpu
disable build_gpu_runtime_ocl
disable build_gpu_runtime_sycl
enable benchdnn_nightly
enable benchdnn_pool
enable arch_cpu_adl
enable arch_cpu_clx
enable arch_cpu_dmr
enable arch_cpu_gnr
enable arch_cpu_hsw
enable arch_cpu_nhm
enable arch_cpu_nvl
enable arch_cpu_skx
enable arch_cpu_snb
enable arch_cpu_spr
enable arch_cpu_srf

@asimonov1 asimonov1 force-pushed the asimonov/maxpool_bwd_bf16_accuracy branch 4 times, most recently from d98dd7d to 5734498 Compare February 5, 2025 16:39
@asimonov1 asimonov1 force-pushed the asimonov/maxpool_bwd_bf16_accuracy branch from 5734498 to 3ee89a1 Compare February 8, 2025 00:39
@asimonov1
Copy link
Contributor Author

make test
disable benchdnn_all
disable test_device_gpu
disable build_gpu_runtime_ocl
disable build_gpu_runtime_sycl
enable benchdnn_nightly
enable benchdnn_pool
enable arch_cpu_adl
enable arch_cpu_clx
enable arch_cpu_dmr
enable arch_cpu_gnr
enable arch_cpu_hsw
enable arch_cpu_nhm
enable arch_cpu_nvl
enable arch_cpu_skx
enable arch_cpu_snb
enable arch_cpu_spr
enable arch_cpu_srf

@asimonov1 asimonov1 force-pushed the asimonov/maxpool_bwd_bf16_accuracy branch 5 times, most recently from fbf38e7 to 3abcafb Compare February 12, 2025 12:18
@asimonov1
Copy link
Contributor Author

make test
disable benchdnn_all
disable test_device_gpu
disable build_gpu_runtime_ocl
disable build_gpu_runtime_sycl
enable benchdnn_nightly
enable benchdnn_pool
enable arch_cpu_adl
enable arch_cpu_clx
enable arch_cpu_dmr
enable arch_cpu_gnr
enable arch_cpu_hsw
enable arch_cpu_nhm
enable arch_cpu_nvl
enable arch_cpu_skx
enable arch_cpu_snb
enable arch_cpu_spr
enable arch_cpu_srf

@asimonov1 asimonov1 marked this pull request as ready for review February 13, 2025 22:40
@asimonov1 asimonov1 requested review from a team as code owners February 13, 2025 22:40
@asimonov1 asimonov1 force-pushed the asimonov/maxpool_bwd_bf16_accuracy branch 2 times, most recently from 26092b9 to 828829f Compare February 18, 2025 16:37
@asimonov1 asimonov1 requested a review from dzarukin February 19, 2025 12:44
@asimonov1 asimonov1 force-pushed the asimonov/maxpool_bwd_bf16_accuracy branch 2 times, most recently from 8e4012d to 393950f Compare February 20, 2025 17:52
@asimonov1
Copy link
Contributor Author

make test
disable benchdnn_all
disable test_device_gpu
disable build_gpu_runtime_ocl
disable build_gpu_runtime_sycl
enable benchdnn_nightly
enable benchdnn_pool
enable arch_cpu_adl
enable arch_cpu_clx
enable arch_cpu_dmr
enable arch_cpu_gnr
enable arch_cpu_hsw
enable arch_cpu_nhm
enable arch_cpu_nvl
enable arch_cpu_skx
enable arch_cpu_snb
enable arch_cpu_spr
enable arch_cpu_srf

@asimonov1 asimonov1 force-pushed the asimonov/maxpool_bwd_bf16_accuracy branch from 393950f to 5b90eee Compare March 4, 2025 14:36
@asimonov1
Copy link
Contributor Author

make test
disable benchdnn_all
disable test_device_gpu
disable build_gpu_runtime_ocl
disable build_gpu_runtime_sycl
enable benchdnn_nightly
enable benchdnn_pool
enable arch_cpu_adl
enable arch_cpu_clx
enable arch_cpu_dmr
enable arch_cpu_gnr
enable arch_cpu_hsw
enable arch_cpu_nhm
enable arch_cpu_nvl
enable arch_cpu_skx
enable arch_cpu_snb
enable arch_cpu_spr
enable arch_cpu_srf

// While this is only required by the backward pass, the quirk above
// is applied to the forward pass as well to keep things simpler.
Opmask k_c_tail_mask = Opmask(
4); // is shared with jit_io_multi_dt_helper_t and jit_uni_postops_injector_t
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: would be nice to have comment above for better readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

--attr-post-ops=

--alg=max
--tag=axb,aBx8b,aBx16b
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's drop aBx8b layout from here. It shouldn't be the case for backward at all. I'd drop 16b as well but it's fine to keep what we had before the change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aBx8b is dropped

--dir=BWD_D
--attr-acc-mode=relaxed
--batch=set_all
--batch=set_topologies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep a smaller batch out of these two (I don't know which one it is).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_all is dropped

Do not use f32 accumulator in jit_uni_pooling for max pooling back
propagation with bf16 if 'relaxed' or 'any' accumulation mode
is specified.
Use zero error threshold in tests for max pooling if 'strict' or
'f32' accumulation mode is specified.
@asimonov1 asimonov1 force-pushed the asimonov/maxpool_bwd_bf16_accuracy branch from 5b90eee to 5cc87a6 Compare March 6, 2025 14:50
@asimonov1 asimonov1 merged commit bc30f90 into main Mar 12, 2025
22 checks passed
@asimonov1 asimonov1 deleted the asimonov/maxpool_bwd_bf16_accuracy branch March 12, 2025 11:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:tests Codeowner: @oneapi-src/onednn-arch platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants