Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aarch64: matmul: addition of JIT int8 kernel #2686

Merged
merged 1 commit into from
Mar 20, 2025

Conversation

Shreyas-fuj
Copy link
Contributor

@Shreyas-fuj Shreyas-fuj commented Feb 13, 2025

Description

This PR introduces the quantised(int8) matmul kernel in OneDNN which gives good speed-ups for quantised model inference on Graviton 3 CPUs.

Kernel supports:

  1. Shapes – 2d,3d,4d

  2. Zeropoint

  • Source – common(s32)
  • Weight – common(s32)
  • Destination – common(s32)
  1. Scales
  • Source – common(f32)
  • Weight – common(f32),per_oc(f32)
  • Destination - common(f32)
  1. Bias – mask - 1xN(f32)

  2. Dynamic quantisation of Source matrix from f32->int8 based on common policy for scales and zp(This feature is available when the flag ONEDNN_AARCH64_MATMUL_SRC_QUANT is enabled at runtime by :
    export ONEDNN_AARCH64_MATMUL_SRC_QUANT=ON

Checklist

General

  • [y] Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
make test

97% tests passed, 6 tests failed out of 219

Total Test time (real) = 4218.29 sec

The following tests FAILED:
	162 - test_graph_unit_dnnl_convolution_cpu (Failed)
	168 - test_graph_unit_dnnl_large_partition_cpu (Failed)
	191 - test_benchdnn_modeC_binary_ci_cpu (Failed)
	192 - test_benchdnn_modeC_binary_different_dt_ci_cpu (Failed)
	200 - test_benchdnn_modeC_graph_ci_cpu (Failed)
	214 - test_benchdnn_modeC_self_ci_cpu (Failed)
Errors while running CTest
Output from these tests are in: /home/shreyas/G/shr-fuj/oneDNN_open_source/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
make: *** [Makefile:71: test] Error 8
  • [y] Have you formatted the code using clang-format?

@Shreyas-fuj Shreyas-fuj requested review from a team as code owners February 13, 2025 08:17
@github-actions github-actions bot added platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 component:api Codeowner: @oneapi-src/onednn-arch component:tests Codeowner: @oneapi-src/onednn-arch component:build labels Feb 13, 2025
@Shreyas-fuj Shreyas-fuj changed the title cpu : aarch64 : matmul : addition of JIT int8 kernel aarch64: matmul: addition of JIT int8 kernel Feb 13, 2025
Copy link
Contributor

@jondea jondea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. I haven't looked in detail at the JIT code itself, I first want to understand the API and motivation. Also, do you have any performance numbers and related benchdnn arguments?

@@ -1,23 +1,23 @@
#===============================================================================
# Copyright 2018-2025 Intel Corporation
#== == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you revert these whitespace changes please? It could be that your IDE auto-formatted them without you realizing.

Copy link
Contributor Author

@Shreyas-fuj Shreyas-fuj Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't notice this. Done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you revert these whitespace changes please? It could be that your IDE auto-formatted them without you realizing.

Sure, I will share the performance results shortly.

@Shreyas-fuj
Copy link
Contributor Author

Shreyas-fuj commented Feb 13, 2025

Thanks for the contribution. I haven't looked in detail at the JIT code itself, I first want to understand the API and motivation. Also, do you have any performance numbers and related benchdnn arguments?

Please find the performance numbers(f32 vs int8) below taken on a 32core Graviton 3E machine at benchdnn level:

<style> </style>
Shapes f32 int8 speedup
4096x4096:4096x128256 1826.83 259.224 7.047302719
2048x4096:4096x128256 883.757 128.232 6.891860066
580x4096:4096x128256 251.723 34.3804 7.321700736
4096x4096:4096x14336 204.306 31.1897 6.550431713
4096x14336:14336x4096 199.682 29.3344 6.807093378
290x4096:4096x128256 129.629 16.0703 8.066370883
2048x14336:14336x4096 98.4181 14.115 6.972589444
2048x4096:4096x14336 99.4149 14.6779 6.773101057
145x4096:4096x128256 70.0352 8.38655 8.350895183
4096x4096:4096x4096 58.6478 8.47308 6.921662489

These shapes were taken from some of the widely used LLM models.

The f32 results were obtained using the command:
./benchdnn --matmul --mode=p MxK:KxN

for int8:
./benchdnn --matmul --dt=s8:s8:f32 --mode=p MxK:KxN

Other arguments like scale and zp can be give to int8 kernel as example shown below:
./benchdnn --matmul --dt=s8:s8:f32 --attr-scales=src:common:3+wei:common:4 --attr-zero-points=src:common:3+wei:common:2 85x324:324x243

@vpirogov
Copy link
Contributor

Tagging @dmitry-gorokhov for OpenVINO

@Shreyas-fuj Shreyas-fuj force-pushed the aarch64_matmul_int8 branch 2 times, most recently from e78e29f to 8849077 Compare February 25, 2025 10:40
@dzarukin
Copy link
Contributor

The code is very raw, I have high level of confidence it would fail on lots of benchdnn cases.

@Shreyas-fuj
Copy link
Contributor Author

The code is very raw, I have high level of confidence it would fail on lots of benchdnn cases.

HI @dzarukin, I will look into these. But there are no failures on benchdnn test cases. It was verified before raising the PR.

@dzarukin
Copy link
Contributor

dzarukin commented Feb 26, 2025

The code is very raw, I have high level of confidence it would fail on lots of benchdnn cases.

HI @dzarukin, I will look into these. But there are no failures on benchdnn test cases. It was verified before raising the PR.

Sorry, that was a quite premature statement. Let me put it this way:

  • Unless I'm really missing something, I would expect to see issues around memory formats because of the way that's used to initialize memory descriptors. Especially transposed tags for src and weights. That's the primary concern I have.
  • It looks like the PR is not rebased on top of the latest main, and that was a second source of my conclusion because with existing change the rebased version wouldn't work due to how quantization is used/checked.
  • There will be clang-tidy issues raised. I marked some of those, but it may find additional ones on top, would be good to address that before promotion.
  • Styling is minor but still good to follow.
  • Verbose support is also minor but really helpful to save time debugging the reason.
  • Personal observation: it seems the existing approach follows a data-centric classes style which has its advantage but the more it grows the harder to keep it accurate. Changing a value for a class to specify which reorder, A or B, to initialize might shot at somebody at some point. Better be careful with such things, because the deeper the whole goes, the harder to debug it.

@Shreyas-fuj
Copy link
Contributor Author

The code is very raw, I have high level of confidence it would fail on lots of benchdnn cases.

HI @dzarukin, I will look into these. But there are no failures on benchdnn test cases. It was verified before raising the PR.

Sorry, that was a quite premature statement. Let me put it this way:

  • Unless I'm really missing something, I would expect to see issues around memory formats because of the way that's used to initialize memory descriptors. Especially transposed tags for src and weights. That's the primary concern I have.
  • It looks like the PR is not rebased on top of the latest main, and that was a second source of my conclusion because with existing change the rebased version wouldn't work due to how quantization is used/checked.
  • There will be clang-tidy issues raised. I marked some of those, but it may find additional ones on top, would be good to address that before promotion.
  • Styling is minor but still good to follow.
  • Verbose support is also minor but really helpful to save time debugging the reason.
  • Personal observation: it seems the existing approach follows a data-centric classes style which has its advantage but the more it grows the harder to keep it accurate. Changing a value for a class to specify which reorder, A or B, to initialize might shot at somebody at some point. Better be careful with such things, because the deeper the whole goes, the harder to debug it.

Thanks for the clarification @dzarukin .
Yes, I will rebase the code to the main and make changes and test and push again.

@Shreyas-fuj Shreyas-fuj force-pushed the aarch64_matmul_int8 branch 2 times, most recently from 844fa49 to 139b4b2 Compare February 26, 2025 10:50
@Shreyas-fuj
Copy link
Contributor Author

@dzarukin Thanks for the extensive review. Your valuable comments and suggestions are highly appreciated. I have made all the suggested changes, request you to please have a look.

@Shreyas-fuj
Copy link
Contributor Author

@vpirogov @dzarukin @Radu2k @jondea , please let me know if there are any other changes required for approval. Thanks.

Copy link
Contributor

@dzarukin dzarukin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last big piece I have is header dependencies for the list file, once it's addressed, it's good to go from my side. Thank you.

@Shreyas-fuj Shreyas-fuj force-pushed the aarch64_matmul_int8 branch 3 times, most recently from e77d3db to 9344e0f Compare March 12, 2025 04:55
@Shreyas-fuj
Copy link
Contributor Author

I have just rebased and force pushed the commit, but the clang-tidy is failing with the error:

Getting Git version info
Temporarily overriding HOME='/home/runner/work/_temp/b589a90e-5114-4043-bb40-2090bc2bf684' before making global git config changes
Adding repository directory to the temporary git global config as a safe directory
/usr/bin/git config --global --add safe.directory /home/runner/work/oneDNN/oneDNN
Deleting the contents of '/home/runner/work/oneDNN/oneDNN'
Initializing the repository
Disabling automatic garbage collection
Setting up auth
Fetching the repository
Determining the checkout info
  /usr/bin/git branch --list --remote origin/aarch64_matmul_int8
  /usr/bin/git tag --list aarch64_matmul_int8
  Error: A branch or tag with the name 'aarch64_matmul_int8' could not be found

Any idea why this is happening?

@dzarukin
Copy link
Contributor

I have just rebased and force pushed the commit, but the clang-tidy is failing with the error:

Getting Git version info
Temporarily overriding HOME='/home/runner/work/_temp/b589a90e-5114-4043-bb40-2090bc2bf684' before making global git config changes
Adding repository directory to the temporary git global config as a safe directory
/usr/bin/git config --global --add safe.directory /home/runner/work/oneDNN/oneDNN
Deleting the contents of '/home/runner/work/oneDNN/oneDNN'
Initializing the repository
Disabling automatic garbage collection
Setting up auth
Fetching the repository
Determining the checkout info
  /usr/bin/git branch --list --remote origin/aarch64_matmul_int8
  /usr/bin/git tag --list aarch64_matmul_int8
  Error: A branch or tag with the name 'aarch64_matmul_int8' could not be found

Any idea why this is happening?

Can the fork be a reason? I'm not totally sure if the linter is designed to work with forks.

@Shreyas-fuj
Copy link
Contributor Author

I have just rebased and force pushed the commit, but the clang-tidy is failing with the error:

Getting Git version info
Temporarily overriding HOME='/home/runner/work/_temp/b589a90e-5114-4043-bb40-2090bc2bf684' before making global git config changes
Adding repository directory to the temporary git global config as a safe directory
/usr/bin/git config --global --add safe.directory /home/runner/work/oneDNN/oneDNN
Deleting the contents of '/home/runner/work/oneDNN/oneDNN'
Initializing the repository
Disabling automatic garbage collection
Setting up auth
Fetching the repository
Determining the checkout info
  /usr/bin/git branch --list --remote origin/aarch64_matmul_int8
  /usr/bin/git tag --list aarch64_matmul_int8
  Error: A branch or tag with the name 'aarch64_matmul_int8' could not be found

Any idea why this is happening?

Can the fork be a reason? I'm not totally sure if the linter is designed to work with forks.

Even I think there is a linter issue, I checked the tracking info on my local repo, aarch64_matmul_int8 is tracking the correct branch in upstream.

@vpirogov
Copy link
Contributor

Any idea why this is happening?

Linter is currently broken for forks. Will be fixed by #2859. It should not block the PR though.

@atkassen
Copy link
Contributor

Linter is currently broken for forks. Will be fixed by #2859. It should not block the PR though.

I've just merged the fix, try rebasing again.

@Shreyas-fuj Shreyas-fuj force-pushed the aarch64_matmul_int8 branch from 9344e0f to 118103c Compare March 13, 2025 04:47
@Shreyas-fuj Shreyas-fuj requested review from a team as code owners March 13, 2025 04:47
@Shreyas-fuj
Copy link
Contributor Author

Linter is currently broken for forks. Will be fixed by #2859. It should not block the PR though.

I've just merged the fix, try rebasing again.

Thanks! I have rebased, it seems to be working fine now.

Copy link
Contributor

@jondea jondea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've addressed all my comments, thank you! I do have one (non-blocking) question though: is this vector length agnostic? Not that it needs to be, but I can't see any reference to anything vector length specific

@Shreyas-fuj
Copy link
Contributor Author

You've addressed all my comments, thank you! I do have one (non-blocking) question though: is this vector length agnostic? Not that it needs to be, but I can't see any reference to anything vector length specific

@jondea , thanks for the approval, right now its written only for GR3 cpus, but can be easily extended to other vector lengths in future by adjusting the block sizes.

@Shreyas-fuj
Copy link
Contributor Author

Hi @vpirogov, is there anything to be done from our side which is blocking the merge of this PR?

@jondea
Copy link
Contributor

jondea commented Mar 19, 2025

right now its written only for GR3 cpus, but can be easily extended to other vector lengths in future by adjusting the block sizes.

That's no problem, the reason I asked is that I couldn't see any dispatch logic to protect other ISAs, I had missed this line because I was searching for 128/256/512

VDISPATCH_MATMUL(get_sve_length() == 32, VERBOSE_UNSUPPORTED_ISA);

But that looks okay, thanks!

@vpirogov
Copy link
Contributor

@Shreyas-fuj, the PR is good to go. Thank you for the contribution!

@vpirogov
Copy link
Contributor

@Shreyas-fuj
Copy link
Contributor Author

@Shreyas-fuj, There're a couple comments from @dzarukin that should be addressed before promotion:

Hi @vpirogov , the second comment was not insisted. I had missed the first comment, thanks.

@Shreyas-fuj Shreyas-fuj force-pushed the aarch64_matmul_int8 branch from 118103c to 7859752 Compare March 20, 2025 04:59
@Shreyas-fuj
Copy link
Contributor Author

@vpirogov , I have made the EOL changes suggested by you. Thanks.

@Shreyas-fuj Shreyas-fuj force-pushed the aarch64_matmul_int8 branch from 7859752 to d54c72a Compare March 20, 2025 06:39
@Shreyas-fuj
Copy link
Contributor Author

Hi @vpirogov, I have addressed all the comments. Thanks!

@vpirogov vpirogov merged commit bdcdab8 into uxlfoundation:main Mar 20, 2025
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:api Codeowner: @oneapi-src/onednn-arch component:tests Codeowner: @oneapi-src/onednn-arch platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants