aarch64: matmul: addition of JIT int8 kernel #2686

Shreyas-fuj · 2025-02-13T08:17:48Z

Description

This PR introduces the quantised(int8) matmul kernel in OneDNN which gives good speed-ups for quantised model inference on Graviton 3 CPUs.

Kernel supports:

Shapes – 2d,3d,4d
Zeropoint

Source – common(s32)
Weight – common(s32)
Destination – common(s32)

Scales

Source – common(f32)
Weight – common(f32),per_oc(f32)
Destination - common(f32)

Bias – mask - 1xN(f32)
Dynamic quantisation of Source matrix from f32->int8 based on common policy for scales and zp(This feature is available when the flag ONEDNN_AARCH64_MATMUL_SRC_QUANT is enabled at runtime by :
export ONEDNN_AARCH64_MATMUL_SRC_QUANT=ON

Checklist

General

[y] Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?

make test

97% tests passed, 6 tests failed out of 219

Total Test time (real) = 4218.29 sec

The following tests FAILED:
	162 - test_graph_unit_dnnl_convolution_cpu (Failed)
	168 - test_graph_unit_dnnl_large_partition_cpu (Failed)
	191 - test_benchdnn_modeC_binary_ci_cpu (Failed)
	192 - test_benchdnn_modeC_binary_different_dt_ci_cpu (Failed)
	200 - test_benchdnn_modeC_graph_ci_cpu (Failed)
	214 - test_benchdnn_modeC_self_ci_cpu (Failed)
Errors while running CTest
Output from these tests are in: /home/shreyas/G/shr-fuj/oneDNN_open_source/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
make: *** [Makefile:71: test] Error 8

[y] Have you formatted the code using clang-format?

jondea

Thanks for the contribution. I haven't looked in detail at the JIT code itself, I first want to understand the API and motivation. Also, do you have any performance numbers and related benchdnn arguments?

jondea · 2025-02-13T09:23:05Z

cmake/options.cmake

@@ -1,23 +1,23 @@
-#===============================================================================
-# Copyright 2018-2025 Intel Corporation
+#== == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == == =


Could you revert these whitespace changes please? It could be that your IDE auto-formatted them without you realizing.

Sorry I didn't notice this. Done.

Could you revert these whitespace changes please? It could be that your IDE auto-formatted them without you realizing.

Sure, I will share the performance results shortly.

cmake/options.cmake

Shreyas-fuj · 2025-02-13T10:45:56Z

Thanks for the contribution. I haven't looked in detail at the JIT code itself, I first want to understand the API and motivation. Also, do you have any performance numbers and related benchdnn arguments?

Please find the performance numbers(f32 vs int8) below taken on a 32core Graviton 3E machine at benchdnn level:

Shapes	f32	int8	speedup
4096x4096:4096x128256	1826.83	259.224	7.047302719
2048x4096:4096x128256	883.757	128.232	6.891860066
580x4096:4096x128256	251.723	34.3804	7.321700736
4096x4096:4096x14336	204.306	31.1897	6.550431713
4096x14336:14336x4096	199.682	29.3344	6.807093378
290x4096:4096x128256	129.629	16.0703	8.066370883
2048x14336:14336x4096	98.4181	14.115	6.972589444
2048x4096:4096x14336	99.4149	14.6779	6.773101057
145x4096:4096x128256	70.0352	8.38655	8.350895183
4096x4096:4096x4096	58.6478	8.47308	6.921662489

These shapes were taken from some of the widely used LLM models.

The f32 results were obtained using the command:
./benchdnn --matmul --mode=p MxK:KxN

for int8:
./benchdnn --matmul --dt=s8:s8:f32 --mode=p MxK:KxN

Other arguments like scale and zp can be give to int8 kernel as example shown below:
./benchdnn --matmul --dt=s8:s8:f32 --attr-scales=src:common:3+wei:common:4 --attr-zero-points=src:common:3+wei:common:2 85x324:324x243

src/cpu/matmul/cpu_matmul_list.cpp

vpirogov · 2025-02-19T17:27:18Z

Tagging @dmitry-gorokhov for OpenVINO

cmake/options.cmake

src/cpu/aarch64/matmul/jit_int8_matmul.cpp

include/oneapi/dnnl/dnnl_types.h

src/cpu/aarch64/matmul/jit_int8_matmul.hpp

src/cpu/aarch64/matmul/jit_int8_matmul.cpp

dzarukin · 2025-02-26T03:53:59Z

The code is very raw, I have high level of confidence it would fail on lots of benchdnn cases.

Shreyas-fuj · 2025-02-26T04:23:11Z

The code is very raw, I have high level of confidence it would fail on lots of benchdnn cases.

HI @dzarukin, I will look into these. But there are no failures on benchdnn test cases. It was verified before raising the PR.

dzarukin · 2025-02-26T05:15:42Z

The code is very raw, I have high level of confidence it would fail on lots of benchdnn cases.

HI @dzarukin, I will look into these. But there are no failures on benchdnn test cases. It was verified before raising the PR.

Sorry, that was a quite premature statement. Let me put it this way:

Unless I'm really missing something, I would expect to see issues around memory formats because of the way that's used to initialize memory descriptors. Especially transposed tags for src and weights. That's the primary concern I have.
It looks like the PR is not rebased on top of the latest main, and that was a second source of my conclusion because with existing change the rebased version wouldn't work due to how quantization is used/checked.
There will be clang-tidy issues raised. I marked some of those, but it may find additional ones on top, would be good to address that before promotion.
Styling is minor but still good to follow.
Verbose support is also minor but really helpful to save time debugging the reason.
Personal observation: it seems the existing approach follows a data-centric classes style which has its advantage but the more it grows the harder to keep it accurate. Changing a value for a class to specify which reorder, A or B, to initialize might shot at somebody at some point. Better be careful with such things, because the deeper the whole goes, the harder to debug it.

Shreyas-fuj · 2025-02-26T05:18:46Z

The code is very raw, I have high level of confidence it would fail on lots of benchdnn cases.

HI @dzarukin, I will look into these. But there are no failures on benchdnn test cases. It was verified before raising the PR.

Sorry, that was a quite premature statement. Let me put it this way:

Unless I'm really missing something, I would expect to see issues around memory formats because of the way that's used to initialize memory descriptors. Especially transposed tags for src and weights. That's the primary concern I have.

It looks like the PR is not rebased on top of the latest main, and that was a second source of my conclusion because with existing change the rebased version wouldn't work due to how quantization is used/checked.

There will be clang-tidy issues raised. I marked some of those, but it may find additional ones on top, would be good to address that before promotion.

Styling is minor but still good to follow.

Verbose support is also minor but really helpful to save time debugging the reason.

Personal observation: it seems the existing approach follows a data-centric classes style which has its advantage but the more it grows the harder to keep it accurate. Changing a value for a class to specify which reorder, A or B, to initialize might shot at somebody at some point. Better be careful with such things, because the deeper the whole goes, the harder to debug it.

Thanks for the clarification @dzarukin .
Yes, I will rebase the code to the main and make changes and test and push again.

Shreyas-fuj · 2025-02-26T11:00:04Z

@dzarukin Thanks for the extensive review. Your valuable comments and suggestions are highly appreciated. I have made all the suggested changes, request you to please have a look.

Shreyas-fuj · 2025-02-28T02:25:14Z

@vpirogov @dzarukin @Radu2k @jondea , please let me know if there are any other changes required for approval. Thanks.

dzarukin

The last big piece I have is header dependencies for the list file, once it's addressed, it's good to go from my side. Thank you.

src/cpu/aarch64/matmul/jit_int8_matmul.hpp

src/cpu/aarch64/matmul/jit_int8_matmul.cpp

Shreyas-fuj · 2025-03-12T04:59:38Z

I have just rebased and force pushed the commit, but the clang-tidy is failing with the error:

Getting Git version info
Temporarily overriding HOME='/home/runner/work/_temp/b589a90e-5114-4043-bb40-2090bc2bf684' before making global git config changes
Adding repository directory to the temporary git global config as a safe directory
/usr/bin/git config --global --add safe.directory /home/runner/work/oneDNN/oneDNN
Deleting the contents of '/home/runner/work/oneDNN/oneDNN'
Initializing the repository
Disabling automatic garbage collection
Setting up auth
Fetching the repository
Determining the checkout info
  /usr/bin/git branch --list --remote origin/aarch64_matmul_int8
  /usr/bin/git tag --list aarch64_matmul_int8
  Error: A branch or tag with the name 'aarch64_matmul_int8' could not be found

Any idea why this is happening?

dzarukin · 2025-03-12T05:17:14Z

I have just rebased and force pushed the commit, but the clang-tidy is failing with the error:

Getting Git version info
Temporarily overriding HOME='/home/runner/work/_temp/b589a90e-5114-4043-bb40-2090bc2bf684' before making global git config changes
Adding repository directory to the temporary git global config as a safe directory
/usr/bin/git config --global --add safe.directory /home/runner/work/oneDNN/oneDNN
Deleting the contents of '/home/runner/work/oneDNN/oneDNN'
Initializing the repository
Disabling automatic garbage collection
Setting up auth
Fetching the repository
Determining the checkout info
  /usr/bin/git branch --list --remote origin/aarch64_matmul_int8
  /usr/bin/git tag --list aarch64_matmul_int8
  Error: A branch or tag with the name 'aarch64_matmul_int8' could not be found

Any idea why this is happening?

Can the fork be a reason? I'm not totally sure if the linter is designed to work with forks.

Shreyas-fuj · 2025-03-12T05:41:23Z

I have just rebased and force pushed the commit, but the clang-tidy is failing with the error:

Getting Git version info
Temporarily overriding HOME='/home/runner/work/_temp/b589a90e-5114-4043-bb40-2090bc2bf684' before making global git config changes
Adding repository directory to the temporary git global config as a safe directory
/usr/bin/git config --global --add safe.directory /home/runner/work/oneDNN/oneDNN
Deleting the contents of '/home/runner/work/oneDNN/oneDNN'
Initializing the repository
Disabling automatic garbage collection
Setting up auth
Fetching the repository
Determining the checkout info
  /usr/bin/git branch --list --remote origin/aarch64_matmul_int8
  /usr/bin/git tag --list aarch64_matmul_int8
  Error: A branch or tag with the name 'aarch64_matmul_int8' could not be found

Any idea why this is happening?

Can the fork be a reason? I'm not totally sure if the linter is designed to work with forks.

Even I think there is a linter issue, I checked the tracking info on my local repo, aarch64_matmul_int8 is tracking the correct branch in upstream.

vpirogov · 2025-03-12T15:02:51Z

Any idea why this is happening?

Linter is currently broken for forks. Will be fixed by #2859. It should not block the PR though.

atkassen · 2025-03-12T15:43:38Z

Linter is currently broken for forks. Will be fixed by #2859. It should not block the PR though.

I've just merged the fix, try rebasing again.

Shreyas-fuj · 2025-03-13T06:46:53Z

Linter is currently broken for forks. Will be fixed by #2859. It should not block the PR though.

I've just merged the fix, try rebasing again.

Thanks! I have rebased, it seems to be working fine now.

jondea

You've addressed all my comments, thank you! I do have one (non-blocking) question though: is this vector length agnostic? Not that it needs to be, but I can't see any reference to anything vector length specific

Shreyas-fuj · 2025-03-14T05:51:59Z

You've addressed all my comments, thank you! I do have one (non-blocking) question though: is this vector length agnostic? Not that it needs to be, but I can't see any reference to anything vector length specific

@jondea , thanks for the approval, right now its written only for GR3 cpus, but can be easily extended to other vector lengths in future by adjusting the block sizes.

Shreyas-fuj · 2025-03-19T05:21:58Z

Hi @vpirogov, is there anything to be done from our side which is blocking the merge of this PR?

jondea · 2025-03-19T08:32:45Z

right now its written only for GR3 cpus, but can be easily extended to other vector lengths in future by adjusting the block sizes.

That's no problem, the reason I asked is that I couldn't see any dispatch logic to protect other ISAs, I had missed this line because I was searching for 128/256/512

VDISPATCH_MATMUL(get_sve_length() == 32, VERBOSE_UNSUPPORTED_ISA);

But that looks okay, thanks!

vpirogov · 2025-03-19T15:49:05Z

@Shreyas-fuj, the PR is good to go. Thank you for the contribution!

src/cpu/aarch64/matmul/jit_int8_kernel_types.hpp

src/cpu/aarch64/matmul/jit_int8_matmul_utils.hpp

src/cpu/aarch64/matmul/jit_int8_matmul.cpp

src/cpu/aarch64/matmul/jit_int8_matmul.hpp

vpirogov · 2025-03-19T16:42:26Z

@Shreyas-fuj, There're a couple comments from @dzarukin that should be addressed before promotion:

Shreyas-fuj · 2025-03-20T04:21:01Z

@Shreyas-fuj, There're a couple comments from @dzarukin that should be addressed before promotion:

aarch64: matmul: addition of JIT int8 kernel #2686 (comment)

aarch64: matmul: addition of JIT int8 kernel #2686 (comment)

Hi @vpirogov , the second comment was not insisted. I had missed the first comment, thanks.

Shreyas-fuj · 2025-03-20T04:59:41Z

@vpirogov , I have made the EOL changes suggested by you. Thanks.

Shreyas-fuj · 2025-03-20T06:40:47Z

Hi @vpirogov, I have addressed all the comments. Thanks!

Shreyas-fuj requested review from a team as code owners February 13, 2025 08:17

github-actions bot added platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 component:api Codeowner: @oneapi-src/onednn-arch component:tests Codeowner: @oneapi-src/onednn-arch component:build labels Feb 13, 2025

Shreyas-fuj force-pushed the aarch64_matmul_int8 branch from 1090335 to 217d249 Compare February 13, 2025 08:40

Shreyas-fuj changed the title ~~cpu : aarch64 : matmul : addition of JIT int8 kernel~~ aarch64: matmul: addition of JIT int8 kernel Feb 13, 2025

jondea reviewed Feb 13, 2025

View reviewed changes

Shreyas-fuj force-pushed the aarch64_matmul_int8 branch from 217d249 to 114c53e Compare February 13, 2025 10:24

Shreyas-fuj force-pushed the aarch64_matmul_int8 branch from 114c53e to 3657daf Compare February 14, 2025 03:35

vpirogov reviewed Feb 14, 2025

View reviewed changes

src/cpu/matmul/cpu_matmul_list.cpp Show resolved Hide resolved

Shreyas-fuj force-pushed the aarch64_matmul_int8 branch from 3657daf to 9dd3b21 Compare February 17, 2025 05:00

vpirogov reviewed Feb 21, 2025

View reviewed changes

cmake/options.cmake Outdated Show resolved Hide resolved

vpirogov reviewed Feb 21, 2025

View reviewed changes

src/cpu/aarch64/matmul/jit_int8_matmul.cpp Outdated Show resolved Hide resolved

Shreyas-fuj force-pushed the aarch64_matmul_int8 branch 2 times, most recently from e78e29f to 8849077 Compare February 25, 2025 10:40

github-actions bot removed the component:build label Feb 25, 2025

dzarukin reviewed Feb 26, 2025

View reviewed changes

Shreyas-fuj force-pushed the aarch64_matmul_int8 branch 2 times, most recently from 844fa49 to 139b4b2 Compare February 26, 2025 10:50

dzarukin reviewed Mar 3, 2025

View reviewed changes

src/cpu/aarch64/matmul/jit_int8_matmul.hpp Show resolved Hide resolved

src/cpu/aarch64/matmul/jit_int8_matmul.hpp Outdated Show resolved Hide resolved

src/cpu/aarch64/matmul/jit_int8_matmul.cpp Outdated Show resolved Hide resolved

Shreyas-fuj force-pushed the aarch64_matmul_int8 branch 3 times, most recently from e77d3db to 9344e0f Compare March 12, 2025 04:55

Shreyas-fuj force-pushed the aarch64_matmul_int8 branch from 9344e0f to 118103c Compare March 13, 2025 04:47

Shreyas-fuj requested review from a team as code owners March 13, 2025 04:47

jondea approved these changes Mar 13, 2025

View reviewed changes