concat2 internal padding #2882

syurkevi · 2025-03-14T01:56:48Z

Description

This PR improves performance for concat with blocked internal padding, cases such as aBc32b:aBc32b 2x33x2:2x32x2 that contain padding inside the concat dimension that must be removed. Better bandwidth utilization has been achieved through a new specialized kernel that attempts to form fully utilized write cache lines from overlapping aligned reads.

Currently, the kernel is specialized to only n=2 inputs and has some restrictions wrt/simd size. The change should address a large portion of the problem layers identified in issue MFDNN-9174.

Since the approach is fairly generic, a geomean speedup of roughly ~30% is observed across multiple platforms. Peak speedup of certain problem layers reaches ~70%.

The kernel has been kept within the reusable_simple_concat implementation. This may need to be revisited to determine the optimal code organization.

Checklist

General

Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
Have you formatted the code using clang-format?

Performance improvements

Have you submitted performance data that demonstrates performance improvements?

syurkevi · 2025-03-14T01:58:29Z

make test perf-gpu
set primitive=concat

dzarukin · 2025-03-14T02:23:21Z

tests/benchdnn/inputs/concat/test_concat_ip

+
+--stag=ABcd32a32b:ABcd32a32b
+--dtag=ABcd32a32b
+4x33x8x8:4x32x8x8


That's too much to validate a pretty narrow new use case.
Please reduce and try to incorporate into existing files.
And this file won't be picked for GPU validation besides that.

Additionally, would be nice to understand where the blocked layout comes from because, AFAIK, activations must move to nxc format for whole model since Xe2.

Reduced the number of tests and incorporated into gpu validation. Do let me know if there are still too many.

There has been some back and forth on the necessity of block support. There were talks it'll still be needed with future platforms but don't recall the details for which customer. Would like to close this narrow case performance gap regardless and will revisit with more improvements(ex: 3+ inputs, large block sizes w/small simd, flexible conf, etc) if there is demand again.

syurkevi · 2025-03-14T17:24:33Z

make test
disable test_device_cpu
disable benchdnn_all
enable benchdnn_concat

atkassen · 2025-03-14T17:41:56Z

make test perf-gpu
set primitive=concat

src/gpu/intel/ocl/reusable_simple_concat.hpp

syurkevi · 2025-03-15T01:09:17Z

make test perf-gpu
set primitive=concat

syurkevi · 2025-03-20T01:11:59Z

make test perf-gpu
set primitive=concat

syurkevi · 2025-03-20T02:04:32Z

make test perf-gpu
set primitive=concat

syurkevi requested review from a team as code owners March 14, 2025 01:56

github-actions bot added platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel component:tests Codeowner: @oneapi-src/onednn-arch labels Mar 14, 2025

syurkevi force-pushed the syurkevi/concat_inner_padding branch from f76cfd3 to 22293cc Compare March 14, 2025 02:10

dzarukin reviewed Mar 14, 2025

View reviewed changes

syurkevi force-pushed the syurkevi/concat_inner_padding branch 2 times, most recently from 7976502 to 8f69bfd Compare March 14, 2025 03:13

atkassen reviewed Mar 14, 2025

View reviewed changes

src/gpu/intel/ocl/reusable_simple_concat.hpp Outdated Show resolved Hide resolved

syurkevi force-pushed the syurkevi/concat_inner_padding branch from fdde2b5 to 679791a Compare March 20, 2025 01:01

syurkevi force-pushed the syurkevi/concat_inner_padding branch from 679791a to 9bc2d32 Compare March 20, 2025 01:53

syurkevi added 7 commits March 21, 2025 12:57

gpu: ocl: concat: add minimal internal padding kernel

8584266

xe: concat: add logic for 2-input specialization

6591ccd

xe: concat: small inner axis correction before refactor

1e7acaf

xe: concat: separate inner-padding and reusable conf

9a37df5

xe: concat: add tests for internal padding

714c082

xe: concat: fix warnings and leading boundary shift

83194b5

xe: concat: reduce register pressure with smaller compute_t

c6f494a

syurkevi force-pushed the syurkevi/concat_inner_padding branch from 9bc2d32 to c6f494a Compare March 21, 2025 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

concat2 internal padding #2882

concat2 internal padding #2882

syurkevi commented Mar 14, 2025

syurkevi commented Mar 14, 2025

dzarukin Mar 14, 2025

syurkevi Mar 14, 2025

syurkevi commented Mar 14, 2025

atkassen commented Mar 14, 2025

syurkevi commented Mar 15, 2025

syurkevi commented Mar 20, 2025

syurkevi commented Mar 20, 2025

concat2 internal padding #2882

Are you sure you want to change the base?

concat2 internal padding #2882

Conversation

syurkevi commented Mar 14, 2025

Description

Checklist

General

Performance improvements

syurkevi commented Mar 14, 2025

dzarukin Mar 14, 2025

Choose a reason for hiding this comment

syurkevi Mar 14, 2025

Choose a reason for hiding this comment

syurkevi commented Mar 14, 2025

atkassen commented Mar 14, 2025

syurkevi commented Mar 15, 2025

syurkevi commented Mar 20, 2025

syurkevi commented Mar 20, 2025