-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xe: add error message when global work sizes overflow #2767
base: main
Are you sure you want to change the base?
Conversation
} | ||
} | ||
if (exceeds_32bit) { | ||
VERROR(common, runtime, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If exceeding 32 bits is fatal then this option is totally appropriate. Are there any scenarios where this works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short answer: Yes, there are scenarios where this works. The driver limitation only applies to kernels generated by the OpenCL Compiler, so nGEN based implementations are not affected. There is a slightly different condition that will always fail, but is not what this PR is targeting.
Long Answers: Our GPU hardware uses a 32-bit counter for specifying a kernels workgroup-id. If the number of work groups (= global_range[i] / local_range[i]
) exceeds 2^32
, computation will fail due to this counter overflowing. This kind of failure will be caught by the runtime which returns an out of resources error. As a consequence, almost all of our kernel implementations will fail past some problem size (and reference kernels are the most effected by this limitation due to the lack of blocking). On top of this, the OpenCL compiler has a bug as it calculates a kernels global_id
as:
uint32_t global_id = workgroup_id * work_group_size + local_id
This results in incorrect results due to the resulting uint32_t
overflow and violates the OpenCL specification as global_id
should have type size_t
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation, Roy! Sounds like VWARN
is more appropriate option here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vpirogov VWARN
does not help with diagnostics as warnings are not enabled by default in ONEDNN_VERBOSE
. Another thing is that we seems to stick to VERROR
for fatal (non-recoverable) errors and reserve VWARN
that do not affect functional behavior (e.g. falling back to a slower implementation but the result is still correct).
With that I have a few options in mind:
- (most radical) Keep
VERROR()
and reportruntime_error
when encountering large sizes- Why: there is no sense to compute something that is incorrect, just bail out and report an error
- Counter-argument: what if the driver fixes that at some point but oneDNN still reports an error? True, but without this fix we can't complete full validation and fix oneDNN-specific issues hence oneDNN is (likely) broken anyway until we run validation with a fixed driver
- Switch to
VWARN()
and enable warnings in QAONEDNN_VERBOSE=warn
to help with diagnostics- Less invasive but makes troubleshooting of QA failures easier
- Keep it as is, no warnings, no errors, still failures in QA
- Large sizes were generally broken before, still having issues now, let's just wait until it's fixed in the driver
- Implement a workaround in oneDNN to address the driver limitation
- This is an option but seems to be high-effort and low-value - large sizes support is not that urgent and critical
I'm learning towards 1 (it just doesn't look right to me to being able to detect an invalid scenario but still continue producing incorrect results) but 2 also works for me to at least help with diagnostics.
@vpirogov @rjoursler Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switch to VWARN() and enable warnings in QA ONEDNN_VERBOSE=warn to help with diagnostics
Less invasive but makes troubleshooting of QA failures easier
I think we should enable warn
in CI either way as this also provides a mechanism to catch OpenCL compiler warnings. I actually already submitted a ticket to do this about a month ago, so we could see about prioritizing this change so it is available sooner.
I'm learning towards 1 (it just doesn't look right to me to being able to detect an invalid scenario but still continue producing incorrect results)
We can do 1, but this needs to be modified to only error for kernels generated by the OpenCL compiler. Also, a fix for this issue is currently being tested in the driver (with performance regressions being the main concern), so I expect there is minimal benefit from adding workaround logic within oneDNN for this particular issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either 1 or 2 looks reasonable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, a fix for this issue is currently being tested in the driver (with performance regressions being the main concern), so I expect there is minimal benefit from adding workaround logic within oneDNN for this particular issue
Thanks for the heads-up. Then let's hold off this PR and wait for the driver upgrade if it's expected soon.
As something to improve in the future - I think we need a softer check for tests like this. If something is known to be broken - let's not make it reported as failing in our QA (e.g. we can ask the Infra team to report failures in some steps as warnings). People get used to this - when something is always failing, and new failures get harder to catch in pre-commit testing.
|
||
#include "gpu/intel/compute/utils.hpp" | ||
#include "common/verbose.hpp" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random Spot:
It would be nice to have a mechanism to "highlight" known failures to save engineer time on triaging.
When we perform testing on release builds, its difficult to catch non-critical errors where oneDNN is required to be functional, but oneDNN or its dependencies are not working as intended. Examples of these soft errors are things like: OpenCL compiler warnings, not using ZeBin in nGEN, excessive regeneration of conv:ir kernels, etc. Because of this, we need an out-of-band channel to signal non-critical errors in testing. I have had some discussions (such as with @dzarukin) on what could be used for this. The conclusion that I reached is that using the verbose warnings for this out-of-band channel makes the most sense as it is effectively a generalization of our current practices of using -Werror
at compliation.
} | ||
} | ||
if (exceeds_32bit) { | ||
VERROR(common, runtime, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short answer: Yes, there are scenarios where this works. The driver limitation only applies to kernels generated by the OpenCL Compiler, so nGEN based implementations are not affected. There is a slightly different condition that will always fail, but is not what this PR is targeting.
Long Answers: Our GPU hardware uses a 32-bit counter for specifying a kernels workgroup-id. If the number of work groups (= global_range[i] / local_range[i]
) exceeds 2^32
, computation will fail due to this counter overflowing. This kind of failure will be caught by the runtime which returns an out of resources error. As a consequence, almost all of our kernel implementations will fail past some problem size (and reference kernels are the most effected by this limitation due to the lack of blocking). On top of this, the OpenCL compiler has a bug as it calculates a kernels global_id
as:
uint32_t global_id = workgroup_id * work_group_size + local_id
This results in incorrect results due to the resulting uint32_t
overflow and violates the OpenCL specification as global_id
should have type size_t
.
This is to improve diagnostics for failing large buffer cases. This is a known issue causing failures in Nightly testing. The PR adds an error message to make it easier to set apart oneDNN-related failures vs driver-related.
I'm not sure this is the best option so please share your ideas as well. It would be nice to have a mechanism to "highlight" known failures to save engineer time on triaging.