Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xe: add error message when global work sizes overflow #2767

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

echeresh
Copy link
Contributor

This is to improve diagnostics for failing large buffer cases. This is a known issue causing failures in Nightly testing. The PR adds an error message to make it easier to set apart oneDNN-related failures vs driver-related.

I'm not sure this is the best option so please share your ideas as well. It would be nice to have a mechanism to "highlight" known failures to save engineer time on triaging.

@echeresh echeresh added the platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel label Feb 27, 2025
@echeresh echeresh requested a review from a team as a code owner February 27, 2025 01:09
}
}
if (exceeds_32bit) {
VERROR(common, runtime,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If exceeding 32 bits is fatal then this option is totally appropriate. Are there any scenarios where this works?

Copy link
Contributor

@rjoursler rjoursler Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short answer: Yes, there are scenarios where this works. The driver limitation only applies to kernels generated by the OpenCL Compiler, so nGEN based implementations are not affected. There is a slightly different condition that will always fail, but is not what this PR is targeting.

Long Answers: Our GPU hardware uses a 32-bit counter for specifying a kernels workgroup-id. If the number of work groups (= global_range[i] / local_range[i]) exceeds 2^32 , computation will fail due to this counter overflowing. This kind of failure will be caught by the runtime which returns an out of resources error. As a consequence, almost all of our kernel implementations will fail past some problem size (and reference kernels are the most effected by this limitation due to the lack of blocking). On top of this, the OpenCL compiler has a bug as it calculates a kernels global_id as:

uint32_t global_id = workgroup_id * work_group_size + local_id

This results in incorrect results due to the resulting uint32_t overflow and violates the OpenCL specification as global_id should have type size_t.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, Roy! Sounds like VWARN is more appropriate option here.

Copy link
Contributor Author

@echeresh echeresh Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vpirogov VWARN does not help with diagnostics as warnings are not enabled by default in ONEDNN_VERBOSE. Another thing is that we seems to stick to VERROR for fatal (non-recoverable) errors and reserve VWARN that do not affect functional behavior (e.g. falling back to a slower implementation but the result is still correct).

With that I have a few options in mind:

  1. (most radical) Keep VERROR() and report runtime_error when encountering large sizes
    • Why: there is no sense to compute something that is incorrect, just bail out and report an error
    • Counter-argument: what if the driver fixes that at some point but oneDNN still reports an error? True, but without this fix we can't complete full validation and fix oneDNN-specific issues hence oneDNN is (likely) broken anyway until we run validation with a fixed driver
  2. Switch to VWARN() and enable warnings in QA ONEDNN_VERBOSE=warn to help with diagnostics
    • Less invasive but makes troubleshooting of QA failures easier
  3. Keep it as is, no warnings, no errors, still failures in QA
    • Large sizes were generally broken before, still having issues now, let's just wait until it's fixed in the driver
  4. Implement a workaround in oneDNN to address the driver limitation
    • This is an option but seems to be high-effort and low-value - large sizes support is not that urgent and critical

I'm learning towards 1 (it just doesn't look right to me to being able to detect an invalid scenario but still continue producing incorrect results) but 2 also works for me to at least help with diagnostics.

@vpirogov @rjoursler Thoughts?

Copy link
Contributor

@rjoursler rjoursler Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switch to VWARN() and enable warnings in QA ONEDNN_VERBOSE=warn to help with diagnostics
Less invasive but makes troubleshooting of QA failures easier

I think we should enable warn in CI either way as this also provides a mechanism to catch OpenCL compiler warnings. I actually already submitted a ticket to do this about a month ago, so we could see about prioritizing this change so it is available sooner.

I'm learning towards 1 (it just doesn't look right to me to being able to detect an invalid scenario but still continue producing incorrect results)

We can do 1, but this needs to be modified to only error for kernels generated by the OpenCL compiler. Also, a fix for this issue is currently being tested in the driver (with performance regressions being the main concern), so I expect there is minimal benefit from adding workaround logic within oneDNN for this particular issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either 1 or 2 looks reasonable to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, a fix for this issue is currently being tested in the driver (with performance regressions being the main concern), so I expect there is minimal benefit from adding workaround logic within oneDNN for this particular issue

Thanks for the heads-up. Then let's hold off this PR and wait for the driver upgrade if it's expected soon.

As something to improve in the future - I think we need a softer check for tests like this. If something is known to be broken - let's not make it reported as failing in our QA (e.g. we can ask the Infra team to report failures in some steps as warnings). People get used to this - when something is always failing, and new failures get harder to catch in pre-commit testing.


#include "gpu/intel/compute/utils.hpp"
#include "common/verbose.hpp"

Copy link
Contributor

@rjoursler rjoursler Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random Spot:

It would be nice to have a mechanism to "highlight" known failures to save engineer time on triaging.

When we perform testing on release builds, its difficult to catch non-critical errors where oneDNN is required to be functional, but oneDNN or its dependencies are not working as intended. Examples of these soft errors are things like: OpenCL compiler warnings, not using ZeBin in nGEN, excessive regeneration of conv:ir kernels, etc. Because of this, we need an out-of-band channel to signal non-critical errors in testing. I have had some discussions (such as with @dzarukin) on what could be used for this. The conclusion that I reached is that using the verbose warnings for this out-of-band channel makes the most sense as it is effectively a generalization of our current practices of using -Werror at compliation.

}
}
if (exceeds_32bit) {
VERROR(common, runtime,
Copy link
Contributor

@rjoursler rjoursler Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short answer: Yes, there are scenarios where this works. The driver limitation only applies to kernels generated by the OpenCL Compiler, so nGEN based implementations are not affected. There is a slightly different condition that will always fail, but is not what this PR is targeting.

Long Answers: Our GPU hardware uses a 32-bit counter for specifying a kernels workgroup-id. If the number of work groups (= global_range[i] / local_range[i]) exceeds 2^32 , computation will fail due to this counter overflowing. This kind of failure will be caught by the runtime which returns an out of resources error. As a consequence, almost all of our kernel implementations will fail past some problem size (and reference kernels are the most effected by this limitation due to the lack of blocking). On top of this, the OpenCL compiler has a bug as it calculates a kernels global_id as:

uint32_t global_id = workgroup_id * work_group_size + local_id

This results in incorrect results due to the resulting uint32_t overflow and violates the OpenCL specification as global_id should have type size_t.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants