Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpu: aarch64: Enable stateless ACL LayerNorm #2804

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

manaalmj
Copy link

@manaalmj manaalmj commented Mar 4, 2025

Description

Make layernorm op use stateless ACL interface.

Checklist

General

  • Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
  • Have you formatted the code using clang-format?

@manaalmj manaalmj requested a review from a team as a code owner March 4, 2025 02:15
@github-actions github-actions bot added the platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 label Mar 4, 2025
@Sqvid
Copy link
Contributor

Sqvid commented Mar 5, 2025

Could you post the benchdnn line you used to test your lnorm changes? Along with performance numbers and oneDNN verbose for the before and after please. Thanks

@Sqvid
Copy link
Contributor

Sqvid commented Mar 5, 2025

Please squash your commits as well please. The first one doesn't build without the fixes in the second

@manaalmj
Copy link
Author

manaalmj commented Mar 5, 2025

Could you post the benchdnn line you used to test your lnorm changes? Along with performance numbers and oneDNN verbose for the before and after please. Thanks

WITHOUT CHANGE
OMP_NUM_THREADS=1 ONEDNN_VERBOSE=all ./benchdnn --lnorm --dir=FWD_I --dt=f32:s8 --tag=axb 257x768_n"lnorm_ci_0d:3" --mode=P
onednn_verbose,v1,info,oneDNN v3.8.0 (commit 1e3bc8d)
onednn_verbose,v1,info,cpu,runtime:OpenMP,nthr:1
onednn_verbose,v1,info,cpu,isa:AArch64 SVE (256 bits)
onednn_verbose,v1,info,gpu,runtime:none
onednn_verbose,v1,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,v1,primitive,create:cache_miss,cpu,layer_normalization,acl,forward_inference,src:f32::blocked:ab::f0 dst:f32:a:blocked:ab::f0 stats:undef::undef:::,,flags:,257x768,0.165039
onednn_verbose,v1,primitive,create:cache_hit,cpu,layer_normalization,acl,forward_inference,src:f32::blocked:ab::f0 dst:f32:a:blocked:ab::f0 stats:undef::undef:::,,flags:,257x768,0.0671387
onednn_verbose,v1,cpu,acl,unsupported: Only Ab4a/Ab8a, BA8b4a/BA4b4a and Acdb8a/Acdb4a destination formats supported
onednn_verbose,v1,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,257x768,0.315918
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,257x768,0.0368652
onednn_verbose,v1,primitive,exec:external,NEMeanStdDevNormalizationKernel,2.09009
onednn_verbose,v1,primitive,exec,cpu,layer_normalization,acl,forward_inference,src:f32::blocked:ab::f0 dst:f32:a:blocked:ab::f0 stats:undef::undef:::,,flags:,257x768,2.28711
onednn_verbose,v1,primitive,create:cache_hit,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,257x768,0.013916
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,257x768,0.0388184
0:PASSED __REPRO: --lnorm --dir=FWD_I --dt=f32:s8 --tag=axb 257x768_nlnorm_ci_0d:3
lnorm driver: WARNING: No problem found for a given option!
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total: 0.07s; fill: 0.02s (23%);

OMP_NUM_THREADS=1 ONEDNN_VERBOSE=all ./benchdnn --lnorm --dir=FWD_I --dt=f32:s8 --tag=axb 256x768_n"lnorm_ci_0d:2" --mode=P
onednn_verbose,v1,info,oneDNN v3.8.0 (commit 1e3bc8d)
onednn_verbose,v1,info,cpu,runtime:OpenMP,nthr:1
onednn_verbose,v1,info,cpu,isa:AArch64 SVE (256 bits)
onednn_verbose,v1,info,gpu,runtime:none
onednn_verbose,v1,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,v1,primitive,create:cache_miss,cpu,layer_normalization,acl,forward_inference,src:f32::blocked:ab::f0 dst:f32:a:blocked:ab::f0 stats:undef::undef:::,,flags:,256x768,0.186035
onednn_verbose,v1,primitive,create:cache_hit,cpu,layer_normalization,acl,forward_inference,src:f32::blocked:ab::f0 dst:f32:a:blocked:ab::f0 stats:undef::undef:::,,flags:,256x768,0.0720215
onednn_verbose,v1,cpu,acl,unsupported: Only Ab4a/Ab8a, BA8b4a/BA4b4a and Acdb8a/Acdb4a destination formats supported
onednn_verbose,v1,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,256x768,0.316895
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,256x768,0.0368652
onednn_verbose,v1,primitive,exec:external,NEMeanStdDevNormalizationKernel,2.05298
onednn_verbose,v1,primitive,exec,cpu,layer_normalization,acl,forward_inference,src:f32::blocked:ab::f0 dst:f32:a:blocked:ab::f0 stats:undef::undef:::,,flags:,256x768,2.26978
onednn_verbose,v1,primitive,create:cache_hit,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,256x768,0.0158691
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,256x768,0.0419922
0:PASSED __REPRO: --lnorm --dir=FWD_I --dt=f32:s8 --tag=axb 256x768_nlnorm_ci_0d:2
lnorm driver: WARNING: No problem found for a given option!
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total: 0.07s; fill: 0.01s (22%);

WITH CHANGE
OMP_NUM_THREADS=1 ONEDNN_VERBOSE=all ./benchdnn --lnorm --dir=FWD_I --dt=f32:s8 --tag=axb 257x768_n"lnorm_ci_0d:3" --mode=P
onednn_verbose,v1,info,oneDNN v3.8.0 (commit 4cc7bbf)
onednn_verbose,v1,info,cpu,runtime:OpenMP,nthr:1
onednn_verbose,v1,info,cpu,isa:AArch64 SVE (256 bits)
onednn_verbose,v1,info,gpu,runtime:none
onednn_verbose,v1,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,v1,primitive,create:cache_miss,cpu,layer_normalization,acl,forward_inference,src:f32::blocked:ab::f0 dst:f32:a:blocked:ab::f0 stats:undef::undef:::,,flags:,257x768,0.2229
onednn_verbose,v1,primitive,create:cache_hit,cpu,layer_normalization,acl,forward_inference,src:f32::blocked:ab::f0 dst:f32:a:blocked:ab::f0 stats:undef::undef:::,,flags:,257x768,0.0078125
onednn_verbose,v1,cpu,acl,unsupported: Only Ab4a/Ab8a, BA8b4a/BA4b4a and Acdb8a/Acdb4a destination formats supported
onednn_verbose,v1,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,257x768,0.271973
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,257x768,0.0979004
onednn_verbose,v1,primitive,exec:external,CpuMeanStdDevNormalizationKernel,2.05005
onednn_verbose,v1,primitive,exec,cpu,layer_normalization,acl,forward_inference,src:f32::blocked:ab::f0 dst:f32:a:blocked:ab::f0 stats:undef::undef:::,,flags:,257x768,2.40796
onednn_verbose,v1,primitive,create:cache_hit,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,257x768,0.013916
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,257x768,0.0419922
0:PASSED __REPRO: --lnorm --dir=FWD_I --dt=f32:s8 --tag=axb 257x768_nlnorm_ci_0d:3
lnorm driver: WARNING: No problem found for a given option!
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total: 0.07s; fill: 0.01s (22%);

OMP_NUM_THREADS=1 ONEDNN_VERBOSE=all ./benchdnn --lnorm --dir=FWD_I --dt=f32:s8 --tag=axb 256x768_n"lnorm_ci_0d:2" --mode=P
onednn_verbose,v1,info,oneDNN v3.8.0 (commit 4cc7bbf)
onednn_verbose,v1,info,cpu,runtime:OpenMP,nthr:1
onednn_verbose,v1,info,cpu,isa:AArch64 SVE (256 bits)
onednn_verbose,v1,info,gpu,runtime:none
onednn_verbose,v1,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,v1,primitive,create:cache_miss,cpu,layer_normalization,acl,forward_inference,src:f32::blocked:ab::f0 dst:f32:a:blocked:ab::f0 stats:undef::undef:::,,flags:,256x768,0.235107
onednn_verbose,v1,primitive,create:cache_hit,cpu,layer_normalization,acl,forward_inference,src:f32::blocked:ab::f0 dst:f32:a:blocked:ab::f0 stats:undef::undef:::,,flags:,256x768,0.0090332
onednn_verbose,v1,cpu,acl,unsupported: Only Ab4a/Ab8a, BA8b4a/BA4b4a and Acdb8a/Acdb4a destination formats supported
onednn_verbose,v1,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,256x768,0.268799
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,256x768,0.0400391
onednn_verbose,v1,primitive,exec:external,CpuMeanStdDevNormalizationKernel,2.02295
onednn_verbose,v1,primitive,exec,cpu,layer_normalization,acl,forward_inference,src:f32::blocked:ab::f0 dst:f32:a:blocked:ab::f0 stats:undef::undef:::,,flags:,256x768,2.37378
onednn_verbose,v1,primitive,create:cache_hit,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,256x768,0.0148926
onednn_verbose,v1,primitive,exec,cpu,reorder,jit:uni,undef,src:f32::blocked:ab::f0 dst:f32::blocked:ab::f0,,,256x768,0.0410156
0:PASSED __REPRO: --lnorm --dir=FWD_I --dt=f32:s8 --tag=axb 256x768_nlnorm_ci_0d:2
lnorm driver: WARNING: No problem found for a given option!
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total: 0.07s; fill: 0.02s (23%);

@manaalmj
Copy link
Author

manaalmj commented Mar 5, 2025

Please squash your commits as well please. The first one doesn't build without the fixes in the second

Done.

@Ryo-not-rio
Copy link
Contributor

Could you show the results for ./benchdnn --lnorm --dir=FWD_I --dt=f32:s8 --tag=axb --mode=P 256x768_n"lnorm_ci_0d:2"? It doesn't look like it's actually showing the performance numbers as --mode=P has to be in front of the problem

@manaalmj
Copy link
Author

manaalmj commented Mar 7, 2025

Perf numbers without the change:
OMP_NUM_THREADS=4 ./benchdnn --lnorm --mode=P --dir=FWD_I --dt=f32:s8 --tag=axb 257x768_n"lnorm_ci_0d:2"
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,acl,lnorm_ci_0d:2,--mode=P --lnorm --dir=FWD_I --dt=f32:s8 --tag=axb 257x768_nlnorm_ci_0d:2,0,0.142578,0.0292969,0,0.0333144,0
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):0.0292969 avg(ms):0.0333144
total: 3.01s; fill: 0.00s (0%);

OMP_NUM_THREADS=16 ./benchdnn --lnorm --mode=P --dir=FWD_I --dt=f32:s8 --tag=axb 257x768_n"lnorm_ci_0d:2"
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,acl,lnorm_ci_0d:2,--mode=P --lnorm --dir=FWD_I --dt=f32:s8 --tag=axb 257x768_nlnorm_ci_0d:2,0,0.13916,0.0266113,0,0.0303047,0
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):0.0266113 avg(ms):0.0303047
total: 3.01s; fill: 0.00s (0%);``

Perf numbers with the change:
OMP_NUM_THREADS=4 ./benchdnn --lnorm --mode=P --dir=FWD_I --dt=f32:s8 --tag=axb 257x768_n"lnorm_ci_0d:2"
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,acl,lnorm_ci_0d:2,--mode=P --lnorm --dir=FWD_I --dt=f32:s8 --tag=axb 257x768_nlnorm_ci_0d:2,0,0.123535,0.0288086,0,0.0339091,0
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):0.0288086 avg(ms):0.0339091
total: 3.01s; fill: 0.00s (0%);

OMP_NUM_THREADS=16 ./benchdnn --lnorm --mode=P --dir=FWD_I --dt=f32:s8 --tag=axb 257x768_n"lnorm_ci_0d:2"
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,acl,lnorm_ci_0d:2,--mode=P --lnorm --dir=FWD_I --dt=f32:s8 --tag=axb 257x768_nlnorm_ci_0d:2,0,0.131592,0.026123,0,0.03056,0
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):0.026123 avg(ms):0.03056
total: 3.01s; fill: 0.00s (0%);

@manaalmj manaalmj force-pushed the feature2 branch 2 times, most recently from 5d12bcc to 0a4d13a Compare March 10, 2025 14:43
Copy link
Contributor

@Ryo-not-rio Ryo-not-rio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perf looks good to me

@Ryo-not-rio Ryo-not-rio self-requested a review March 10, 2025 14:48
Copy link
Contributor

@Sqvid Sqvid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch. In addition to my review comments. Can we take this opportunity to move all function definitions into the cpp file please? Thank you.

@manaalmj
Copy link
Author

Thanks for the patch. In addition to my review comments. Can we take this opportunity to move all function definitions into the cpp file please? Thank you.

Done.

@manaalmj manaalmj requested review from a team as code owners March 20, 2025 14:13
@github-actions github-actions bot added the component:tests Codeowner: @oneapi-src/onednn-arch label Mar 20, 2025
@github-actions github-actions bot removed the component:tests Codeowner: @oneapi-src/onednn-arch label Mar 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants