Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xe: sdpa: improve performance of quantized sdpa with a head size of 64 #2921

Merged
merged 2 commits into from
Mar 21, 2025

Conversation

umar456
Copy link
Contributor

@umar456 umar456 commented Mar 19, 2025

Description

This PR updates the configurations for the SDPA kernel to optimized for a head size of 64. These updates improve the performance of small head sizes by 1.1-1.5x on LNL. Performance in other platforms will be posted soon.

| mb |  N |  D |   KV |    Q | kdt       | vdt       | mask   | quant                 | sdpa(main) | sdpa(PR) | speedup vs. main |
|----+----+----+------+------+-----------+-----------+--------+-----------------------+------------+----------+------------------|
|  1 | 32 | 64 |  385 |    1 | s8/f16/na | s8/f16/na | causal | per_token_with_groups |      80.75 |     62.5 |            1.292 |
|  1 | 32 | 64 |  513 |    1 | s8/f16/na | s8/f16/na | causal | per_token_with_groups |      91.96 |    70.78 |        1.2992371 |
|  1 | 32 | 64 | 1025 |    1 | s8/f16/na | s8/f16/na | causal | per_token_with_groups |     122.32 |   101.19 |        1.2088151 |
|  1 | 32 | 64 | 2049 |    1 | s8/f16/na | s8/f16/na | causal | per_token_with_groups |     216.77 |   154.16 |        1.4061365 |
|  1 | 32 | 64 | 4097 |    1 | s8/f16/na | s8/f16/na | causal | per_token_with_groups |     398.24 |    257.9 |        1.5441644 |
|----+----+----+------+------+-----------+-----------+--------+-----------------------+------------+----------+------------------|
|  1 | 32 | 64 |  384 |  384 | s8/f16/na | s8/f16/na | causal | per_token_with_groups |     402.67 |   299.89 |        1.3427257 |
|  1 | 32 | 64 |  512 |  512 | s8/f16/na | s8/f16/na | causal | per_token_with_groups |     512.16 |   458.93 |        1.1159872 |
|  1 | 32 | 64 | 1024 | 1024 | s8/f16/na | s8/f16/na | causal | per_token_with_groups |    1801.07 |  1616.98 |        1.1138480 |
|  1 | 32 | 64 | 2048 | 2048 | s8/f16/na | s8/f16/na | causal | per_token_with_groups |    9421.45 |  6120.72 |        1.5392715 |
|  1 | 32 | 64 | 4096 | 4096 | s8/f16/na | s8/f16/na | causal | per_token_with_groups |    26456.8 |  23432.6 |        1.1290595 |
#+TBLFM: $12=$10/$11

Addresses: MFDNN-11755

This PR also refactors the checks for the input descriptors so that its uniform between the internal primitive and the Graph API.

@umar456 umar456 added performance platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel labels Mar 19, 2025
@umar456 umar456 requested review from a team as code owners March 19, 2025 20:21
@umar456 umar456 force-pushed the uarshad/more_sdpa_configs branch from 4599fbd to ce45811 Compare March 20, 2025 15:07
@umar456
Copy link
Contributor Author

umar456 commented Mar 20, 2025

make test
disable test_device_cpu
disable build_cpu_runtime_omp
disable build_cpu_runtime_sycl
disable build_cpu_runtime_tbb
disable benchdnn_all
enable benchdnn_graph
enable test_device_gpu
enable arch_gpu_xe-hpc
enable arch_gpu_xe-hpg-atsm
enable arch_gpu_xe-hpg-dg2
enable arch_gpu_xe-lp
enable arch_gpu_xe-lpg
enable arch_gpu_xe-lpg+
enable arch_gpu_xe2-hpg-bmg
enable arch_gpu_xe2-lpg

@umar456 umar456 merged commit 383a3fb into main Mar 21, 2025
15 of 17 checks passed
@umar456 umar456 deleted the uarshad/more_sdpa_configs branch March 21, 2025 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants