-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
concat2 internal padding #2882
base: main
Are you sure you want to change the base?
concat2 internal padding #2882
Conversation
make test perf-gpu |
f76cfd3
to
22293cc
Compare
|
||
--stag=ABcd32a32b:ABcd32a32b | ||
--dtag=ABcd32a32b | ||
4x33x8x8:4x32x8x8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's too much to validate a pretty narrow new use case.
Please reduce and try to incorporate into existing files.
And this file won't be picked for GPU validation besides that.
Additionally, would be nice to understand where the blocked layout comes from because, AFAIK, activations must move to nxc format for whole model since Xe2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reduced the number of tests and incorporated into gpu validation. Do let me know if there are still too many.
There has been some back and forth on the necessity of block support. There were talks it'll still be needed with future platforms but don't recall the details for which customer. Would like to close this narrow case performance gap regardless and will revisit with more improvements(ex: 3+ inputs, large block sizes w/small simd, flexible conf, etc) if there is demand again.
7976502
to
8f69bfd
Compare
make test |
make test perf-gpu |
make test perf-gpu |
fdde2b5
to
679791a
Compare
make test perf-gpu |
679791a
to
9bc2d32
Compare
make test perf-gpu |
9bc2d32
to
c6f494a
Compare
Description
This PR improves performance for concat with blocked internal padding, cases such as
aBc32b:aBc32b 2x33x2:2x32x2
that contain padding inside the concat dimension that must be removed. Better bandwidth utilization has been achieved through a new specialized kernel that attempts to form fully utilized write cache lines from overlapping aligned reads.Currently, the kernel is specialized to only n=2 inputs and has some restrictions wrt/simd size. The change should address a large portion of the problem layers identified in issue MFDNN-9174.
Since the approach is fairly generic, a geomean speedup of roughly ~30% is observed across multiple platforms. Peak speedup of certain problem layers reaches ~70%.
The kernel has been kept within the reusable_simple_concat implementation. This may need to be revisited to determine the optimal code organization.
Checklist
General
make test
andmake test_benchdnn_*
) pass locally for each commit?Performance improvements