Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xe: add error message when global work sizes overflow #2767

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions src/gpu/intel/compute/utils.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
/*******************************************************************************
* Copyright 2025 Intel Corporation
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*******************************************************************************/

#include "gpu/intel/compute/utils.hpp"
#include "common/verbose.hpp"

Copy link
Contributor

@rjoursler rjoursler Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random Spot:

It would be nice to have a mechanism to "highlight" known failures to save engineer time on triaging.

When we perform testing on release builds, its difficult to catch non-critical errors where oneDNN is required to be functional, but oneDNN or its dependencies are not working as intended. Examples of these soft errors are things like: OpenCL compiler warnings, not using ZeBin in nGEN, excessive regeneration of conv:ir kernels, etc. Because of this, we need an out-of-band channel to signal non-critical errors in testing. I have had some discussions (such as with @dzarukin) on what could be used for this. The conclusion that I reached is that using the verbose warnings for this out-of-band channel makes the most sense as it is effectively a generalization of our current practices of using -Werror at compliation.

#include <limits>

namespace dnnl {
namespace impl {
namespace gpu {
namespace intel {
namespace compute {

void check_global_range(const compute::range_t &range) {
bool exceeds_32bit = false;
const size_t u32_max = std::numeric_limits<uint32_t>::max();
for (size_t i = 0; i < range.ndims(); i++) {
if (range[i] > u32_max) {
exceeds_32bit = true;
break;
}
}
if (exceeds_32bit) {
VERROR(common, runtime,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If exceeding 32 bits is fatal then this option is totally appropriate. Are there any scenarios where this works?

Copy link
Contributor

@rjoursler rjoursler Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short answer: Yes, there are scenarios where this works. The driver limitation only applies to kernels generated by the OpenCL Compiler, so nGEN based implementations are not affected. There is a slightly different condition that will always fail, but is not what this PR is targeting.

Long Answers: Our GPU hardware uses a 32-bit counter for specifying a kernels workgroup-id. If the number of work groups (= global_range[i] / local_range[i]) exceeds 2^32 , computation will fail due to this counter overflowing. This kind of failure will be caught by the runtime which returns an out of resources error. As a consequence, almost all of our kernel implementations will fail past some problem size (and reference kernels are the most effected by this limitation due to the lack of blocking). On top of this, the OpenCL compiler has a bug as it calculates a kernels global_id as:

uint32_t global_id = workgroup_id * work_group_size + local_id

This results in incorrect results due to the resulting uint32_t overflow and violates the OpenCL specification as global_id should have type size_t.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, Roy! Sounds like VWARN is more appropriate option here.

Copy link
Contributor Author

@echeresh echeresh Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vpirogov VWARN does not help with diagnostics as warnings are not enabled by default in ONEDNN_VERBOSE. Another thing is that we seems to stick to VERROR for fatal (non-recoverable) errors and reserve VWARN that do not affect functional behavior (e.g. falling back to a slower implementation but the result is still correct).

With that I have a few options in mind:

  1. (most radical) Keep VERROR() and report runtime_error when encountering large sizes
    • Why: there is no sense to compute something that is incorrect, just bail out and report an error
    • Counter-argument: what if the driver fixes that at some point but oneDNN still reports an error? True, but without this fix we can't complete full validation and fix oneDNN-specific issues hence oneDNN is (likely) broken anyway until we run validation with a fixed driver
  2. Switch to VWARN() and enable warnings in QA ONEDNN_VERBOSE=warn to help with diagnostics
    • Less invasive but makes troubleshooting of QA failures easier
  3. Keep it as is, no warnings, no errors, still failures in QA
    • Large sizes were generally broken before, still having issues now, let's just wait until it's fixed in the driver
  4. Implement a workaround in oneDNN to address the driver limitation
    • This is an option but seems to be high-effort and low-value - large sizes support is not that urgent and critical

I'm learning towards 1 (it just doesn't look right to me to being able to detect an invalid scenario but still continue producing incorrect results) but 2 also works for me to at least help with diagnostics.

@vpirogov @rjoursler Thoughts?

Copy link
Contributor

@rjoursler rjoursler Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switch to VWARN() and enable warnings in QA ONEDNN_VERBOSE=warn to help with diagnostics
Less invasive but makes troubleshooting of QA failures easier

I think we should enable warn in CI either way as this also provides a mechanism to catch OpenCL compiler warnings. I actually already submitted a ticket to do this about a month ago, so we could see about prioritizing this change so it is available sooner.

I'm learning towards 1 (it just doesn't look right to me to being able to detect an invalid scenario but still continue producing incorrect results)

We can do 1, but this needs to be modified to only error for kernels generated by the OpenCL compiler. Also, a fix for this issue is currently being tested in the driver (with performance regressions being the main concern), so I expect there is minimal benefit from adding workaround logic within oneDNN for this particular issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either 1 or 2 looks reasonable to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, a fix for this issue is currently being tested in the driver (with performance regressions being the main concern), so I expect there is minimal benefit from adding workaround logic within oneDNN for this particular issue

Thanks for the heads-up. Then let's hold off this PR and wait for the driver upgrade if it's expected soon.

As something to improve in the future - I think we need a softer check for tests like this. If something is known to be broken - let's not make it reported as failing in our QA (e.g. we can ask the Infra team to report failures in some steps as warnings). People get used to this - when something is always failing, and new failures get harder to catch in pre-commit testing.

"global work size exceeds the 32-bit limit. Potential "
"correctness issues may arise due to driver limitation");
}
}

} // namespace compute
} // namespace intel
} // namespace gpu
} // namespace impl
} // namespace dnnl
4 changes: 3 additions & 1 deletion src/gpu/intel/compute/utils.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*******************************************************************************
* Copyright 2019-2024 Intel Corporation
* Copyright 2019-2025 Intel Corporation
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -155,6 +155,8 @@ class nd_range_t {
range_t local_range_;
};

void check_global_range(const compute::range_t &range);

} // namespace compute
} // namespace intel
} // namespace gpu
Expand Down
1 change: 1 addition & 0 deletions src/gpu/intel/ocl/kernel.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -228,6 +228,7 @@ status_t kernel_t::parallel_for(impl::stream_t &stream,
cl_uint ndims = static_cast<cl_uint>(range.ndims());
if (range.is_zero()) { return status::success; }

check_global_range(range.global_range());
xpu::ocl::wrapper_t<cl_event> event;
if (ocl_stream->flags() & stream_flags::out_of_order) {
const auto &event_wrappers = xpu::ocl::event_t::from(deps).events;
Expand Down
1 change: 1 addition & 0 deletions src/gpu/intel/sycl/sycl_interop_gpu_kernel.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,7 @@ status_t sycl_interop_gpu_kernel_t::parallel_for(impl::stream_t &stream,
set_scalar_arg(cgh, (int)i, arg.scalar_type(), arg.value());
}
}
check_global_range(range.global_range());
if (range.local_range()) {
auto sycl_nd_range = gpu::intel::sycl::to_sycl_nd_range(range);
cgh.parallel_for(sycl_nd_range, *sycl_kernel_);
Expand Down