-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xe: add error message when global work sizes overflow #2767
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
/******************************************************************************* | ||
* Copyright 2025 Intel Corporation | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*******************************************************************************/ | ||
|
||
#include "gpu/intel/compute/utils.hpp" | ||
#include "common/verbose.hpp" | ||
|
||
#include <limits> | ||
|
||
namespace dnnl { | ||
namespace impl { | ||
namespace gpu { | ||
namespace intel { | ||
namespace compute { | ||
|
||
void check_global_range(const compute::range_t &range) { | ||
bool exceeds_32bit = false; | ||
const size_t u32_max = std::numeric_limits<uint32_t>::max(); | ||
for (size_t i = 0; i < range.ndims(); i++) { | ||
if (range[i] > u32_max) { | ||
exceeds_32bit = true; | ||
break; | ||
} | ||
} | ||
if (exceeds_32bit) { | ||
VERROR(common, runtime, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If exceeding 32 bits is fatal then this option is totally appropriate. Are there any scenarios where this works? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Short answer: Yes, there are scenarios where this works. The driver limitation only applies to kernels generated by the OpenCL Compiler, so nGEN based implementations are not affected. There is a slightly different condition that will always fail, but is not what this PR is targeting. Long Answers: Our GPU hardware uses a 32-bit counter for specifying a kernels workgroup-id. If the number of work groups (
This results in incorrect results due to the resulting There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the explanation, Roy! Sounds like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @vpirogov With that I have a few options in mind:
I'm learning towards 1 (it just doesn't look right to me to being able to detect an invalid scenario but still continue producing incorrect results) but 2 also works for me to at least help with diagnostics. @vpirogov @rjoursler Thoughts? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think we should enable
We can do 1, but this needs to be modified to only error for kernels generated by the OpenCL compiler. Also, a fix for this issue is currently being tested in the driver (with performance regressions being the main concern), so I expect there is minimal benefit from adding workaround logic within oneDNN for this particular issue. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Either 1 or 2 looks reasonable to me. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Thanks for the heads-up. Then let's hold off this PR and wait for the driver upgrade if it's expected soon. As something to improve in the future - I think we need a softer check for tests like this. If something is known to be broken - let's not make it reported as failing in our QA (e.g. we can ask the Infra team to report failures in some steps as warnings). People get used to this - when something is always failing, and new failures get harder to catch in pre-commit testing. |
||
"global work size exceeds the 32-bit limit. Potential " | ||
"correctness issues may arise due to driver limitation"); | ||
} | ||
} | ||
|
||
} // namespace compute | ||
} // namespace intel | ||
} // namespace gpu | ||
} // namespace impl | ||
} // namespace dnnl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random Spot:
When we perform testing on release builds, its difficult to catch non-critical errors where oneDNN is required to be functional, but oneDNN or its dependencies are not working as intended. Examples of these soft errors are things like: OpenCL compiler warnings, not using ZeBin in nGEN, excessive regeneration of conv:ir kernels, etc. Because of this, we need an out-of-band channel to signal non-critical errors in testing. I have had some discussions (such as with @dzarukin) on what could be used for this. The conclusion that I reached is that using the verbose warnings for this out-of-band channel makes the most sense as it is effectively a generalization of our current practices of using
-Werror
at compliation.