Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concat2 internal padding #2882

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Conversation

syurkevi
Copy link
Contributor

Description

This PR improves performance for concat with blocked internal padding, cases such as aBc32b:aBc32b 2x33x2:2x32x2 that contain padding inside the concat dimension that must be removed. Better bandwidth utilization has been achieved through a new specialized kernel that attempts to form fully utilized write cache lines from overlapping aligned reads.

Currently, the kernel is specialized to only n=2 inputs and has some restrictions wrt/simd size. The change should address a large portion of the problem layers identified in issue MFDNN-9174.

Since the approach is fairly generic, a geomean speedup of roughly ~30% is observed across multiple platforms. Peak speedup of certain problem layers reaches ~70%.

The kernel has been kept within the reusable_simple_concat implementation. This may need to be revisited to determine the optimal code organization.

Checklist

General

  • Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
  • Have you formatted the code using clang-format?

Performance improvements

  • Have you submitted performance data that demonstrates performance improvements?

@syurkevi syurkevi requested review from a team as code owners March 14, 2025 01:56
@github-actions github-actions bot added platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel component:tests Codeowner: @oneapi-src/onednn-arch labels Mar 14, 2025
@syurkevi
Copy link
Contributor Author

make test perf-gpu
set primitive=concat

@syurkevi syurkevi force-pushed the syurkevi/concat_inner_padding branch from f76cfd3 to 22293cc Compare March 14, 2025 02:10

--stag=ABcd32a32b:ABcd32a32b
--dtag=ABcd32a32b
4x33x8x8:4x32x8x8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's too much to validate a pretty narrow new use case.
Please reduce and try to incorporate into existing files.
And this file won't be picked for GPU validation besides that.

Additionally, would be nice to understand where the blocked layout comes from because, AFAIK, activations must move to nxc format for whole model since Xe2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduced the number of tests and incorporated into gpu validation. Do let me know if there are still too many.

There has been some back and forth on the necessity of block support. There were talks it'll still be needed with future platforms but don't recall the details for which customer. Would like to close this narrow case performance gap regardless and will revisit with more improvements(ex: 3+ inputs, large block sizes w/small simd, flexible conf, etc) if there is demand again.

@syurkevi syurkevi force-pushed the syurkevi/concat_inner_padding branch 2 times, most recently from 7976502 to 8f69bfd Compare March 14, 2025 03:13
@syurkevi
Copy link
Contributor Author

make test
disable test_device_cpu
disable benchdnn_all
enable benchdnn_concat

@atkassen
Copy link
Contributor

make test perf-gpu
set primitive=concat

@syurkevi
Copy link
Contributor Author

make test perf-gpu
set primitive=concat

@syurkevi syurkevi force-pushed the syurkevi/concat_inner_padding branch from fdde2b5 to 679791a Compare March 20, 2025 01:01
@syurkevi
Copy link
Contributor Author

make test perf-gpu
set primitive=concat

@syurkevi syurkevi force-pushed the syurkevi/concat_inner_padding branch from 679791a to 9bc2d32 Compare March 20, 2025 01:53
@syurkevi
Copy link
Contributor Author

make test perf-gpu
set primitive=concat

@syurkevi syurkevi force-pushed the syurkevi/concat_inner_padding branch from 9bc2d32 to c6f494a Compare March 21, 2025 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:tests Codeowner: @oneapi-src/onednn-arch platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants