-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
generic: sycl: Introduce spec constants for sycl matmul #2250
base: main
Are you sure you want to change the base?
generic: sycl: Introduce spec constants for sycl matmul #2250
Conversation
#ifdef DNNL_SYCL_GENERIC | ||
// skip unimplemented configs for sycl impl | ||
|| is_gpu() | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed the long term goal. As the generic backend is still a work in progress and new features are being added to oneDNN it is useful to be able to skip some configurations that are not supported yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct, this has been removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed it a bit prematurely, its added back in until a conclusion is reached
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any unimplemented hits in current validation. It makes me think if there are any new unimplemented cases, they must be local to these changes. I can provide more guidance if there's a pointer on or a snapshot of what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this change running the oneDNN compiled with generic vendor (sycl ref matmul used) causes benchdnn to fail with test unimplemented error when testing with runtime dims, spec constants require kernel to be compiled for the specific spec constants (specific sizes for src/dst/weights) so current approach does now support runtime dims.
We are wondering if this can should be reported as skipped by benchdnn, as it is expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds to me like a matmul-specific skip.
I propose to introduce is_generic_gpu()
method identically to is_nvidia_gpu()
, and put a check here (before the if() {} section).
The change in this PR will silence ALL drivers with ANY unimplemented hit which is not the desired behavior in a long term.
The practice to mark unimplemented as skipped is to figure out the minimal scope all of them fit and skip only for it. In your case is sounds like any runtime dimensions involved. You may relate to this statement when crafting it
src/xpu/sycl/types.hpp
Outdated
@@ -146,6 +177,23 @@ struct md_t { | |||
#undef CHECK_AND_ASSIGN | |||
} | |||
|
|||
// This constructor is to be called inside the kernel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you control this copy constructor will be called only from the inside the kernel and not from other places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed this comment as this seems to be more misleading than helpful, the idea of the new constructor is to allow the spec constant struct to be convertible to the sycl version of the memory descriptor md_t
, not to discourage use outside of the kernel, which would be fine.
src/xpu/sycl/types.hpp
Outdated
@@ -146,6 +177,23 @@ struct md_t { | |||
#undef CHECK_AND_ASSIGN | |||
} | |||
|
|||
// This constructor is to be called inside the kernel | |||
md_t(md_t_spec_const &mimicker) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
md_t(md_t_spec_const &mimicker) { | |
md_t(const md_t_spec_const &mimicker) : data_type_(...), ... { | |
for () { | |
... | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the suggested change
dim32_t inner_nblks_; | ||
dims32_t inner_blks_; | ||
dims32_t inner_idxs_; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// This struct cannot have a non trivial constructor, or any non trivial types, because...
// why exactly?
static_assert(std::is_trivial_v<md_t_spec_const>, "md_t_spec_const must be trivial");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking into this a bit more DPC++ defines SYCL_DEVICE_COPYABLE=1
so we may be able to avoid the need of md_t_spec_const
if we specialize is_device_copyable_v<md_t>
to return true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the knowledge not available from reading the code better be put in a comment as a reference with exact spot to look or as a summary.
55e83bd
to
ac9c4aa
Compare
src/gpu/generic/sycl/ref_matmul.cpp
Outdated
@@ -52,16 +55,56 @@ void ref_matmul_t::pd_t::init_conf() { | |||
for (const auto &mdw : {src_d, weights_d, dst_d, bias_d}) { | |||
if (mdw.has_runtime_dims()) { | |||
any_runtime_params_ = true; | |||
return; | |||
return status::unimplemented; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(minor) please don't hesitate to use verbose macros for this kind of checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion, made the change and prefer it this way as well, thank you
src/gpu/generic/sycl/ref_matmul.cpp
Outdated
memory_desc_wrapper mdw(md); | ||
|
||
assert(mdw.format_kind() == format_kind::blocked); | ||
assert(mdw.ndims() <= max_dims); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that this function can fail, I would advocate to return a status instead of using asserts, and propagate by it in init_conf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asserts changed to verbose checks, propagating status_t now as well.
parallel_for(ctx, kernel_, [&](::sycl::handler &cgh) { | ||
matmul_kernel_fwd_t matmul_kernel(conf, cgh, ctx); | ||
matmul_kernel_fwd_t matmul_kernel(pd()->conf_, cgh, ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed, using specialization constants will trigger recompilation during execute function.
TBH, I am a bit bothered by the precedent this would set, as multiple users rely on the property that after primitive creation, there should be no recurrent jitting overheads. Are we sure there is no other venue to avoid the argument size limitation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow, we discussed that the spec constant would only trigger recompilation if we supported runtime dimensions. We discussed it does not make sense to support runtime dimensions so in the current patch the kernel compilation should only happen in the init
function.
From what I gathered there are no great solutions in the short term:
- Reducing the amount of arguments given to the kernel. We already use a struct as an argument to only give the arguments that may be used in the SYCL kernels.
- Reducing the number of post-op supported: we already support "only" 5 post-ops. I wouldn't mind reducing this further but this may disable more tests. My understanding was that we didn't want to reduce this further.
The long term solution is to use SYCL-RTC but this will be a large task. The spec constants are useful to be able to run more tests with the SYCL backend until SYCL-RTC is implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow, we discussed that the spec constant would only trigger recompilation if we supported runtime dimensions. We discussed it does not make sense to support runtime dimensions
I guess I missed that part. Given that this is a reference implementation, I would expect that at some point it would support runtime dimensions. Solving the current argument size limitation issue with specialization_constant would defer runtime dimension support until SYCL-RTC is enabled. Is that something we are fine with?
Reducing the amount of arguments given to the kernel. We already use a struct as an argument to only give the arguments that may be used in the SYCL kernels.
IIUC, sycl_matmul_conf_t is used to pass argument shapes to the kernel right?
If so, it seems to use sycl_post_ops_t which is a union type, for which the biggest members are prelu_post_op with ~100 Bytes and binary_post_op with ~160 Bytes. So 5 post_ops gets us to ~800B, do I get that right?
(which is about 1/4 of argument size limit).
In any case, here are a few suggestions:
- for prelu_post_op, using dim_t[xpu::sycl::md_t::max_dims] instead of dims_t for the strides array would half that size.
- for binary_post_op, we could use a couple of arrays for dims and strides instead of using md_t?
- In sycl::md_t, we have about 50B related to blocking (inner_block, here). Do we actually use blocked layouts in generic sycl kernels? If not that would shape about 250B from post_ops, and ~500B from matmul_conf.
- separate post-ops that require extra md from the others, so that we can reduce the number of post-ops that require passing md to kernel (e.g. binary/prelu), while not restricting others (e.g. eltwise).
Reducing the number of post-op supported: we already support "only" 5 post-ops. I wouldn't mind reducing this further but this may disable more tests. My understanding was that we didn't want to reduce this further.
Correct 5 is already low. However it seems we are using a union for all postops. Splitting eltwise from binary/prelu postop could help limit only the number of binary/prelu post-ops (e.g. to 2 binary/prelu post-ops), while not restricting the number of eltwise post-ops. That could be an option as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed suggestion. From previous investigations it did not seem possible to reduce enough the arguments size but we could investigate again. I'm not confident we can remove enough members from the binary post op.
Just one clarification, the argument size limit is 2kB so the 5 post-ops would represents about 40% of the total size. We need to shave off a total of 282B or at least 57B per post-op. I haven't looked into this myself but this seems difficult just for the binary post op which is apparently the biggest issue here.
Solving the current argument size limitation issue with specialization_constant would defer runtime dimension support until SYCL-RTC is enabled. Is that something we are fine with?
This sounds fine to me. From what I understand runtime dimensions are not that important if they are just meant for optimization purposes but the SYCL backend is not optimized yet. At this point it is easier to use spec-constant. This is just a workaround for us to be able to run tests and have a clearer picture of what operations are working as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mgouicem , Thank you for your detailed suggestions. The motivation for going ahead with spec-constants was to keep the changes as minimal as possible, which would be fairly simple to revert and do not involve considerable refactoring / changes for one specific kernel which lead to code duplication, or changing the "core" of oneDNN, i.e. the de-facto way of doing things in oneDNN. The motivation of avoiding this is because we expect all of these changes to be reverted once SYCL-RTC is in place.
In sycl::md_t, we have about 50B related to blocking (inner_block, here). Do we actually use blocked layouts in generic sycl kernels? If not that would shape about 250B from post_ops, and ~500B from matmul_conf.
I believe that the i8 matmuls which typically go to cublasLT, which expects data in a blocked formats, falls back to the generic kernel for the binary_post_op
. Hence the support of blocking over there.
separate post-ops that require extra md from the others, so that we can reduce the number of post-ops that require passing md to kernel (e.g. binary/prelu), while not restricting others (e.g. eltwise).
Since the post_ops_t struct is used by other reference primitives as well, this would mandate a change in them as well, and if something similar were to be developed just for the matmul kernel, it would lead to a lot of code duplication, hence it may not be viable.
for binary_post_op, we could use a couple of arrays for dims and strides instead of using md_t?
The kernel uses the off_v
and the off_v_masked
methods to calculate certain offsets, and those functions take care of padding if any etc etc. So that logic would need to be copied into another function if we were to isolate these two arrays into a struct of their own
If specialization constants is not preferable solution to this problem, we were also thinking of moving the kernel arguments to global memory. This would not trigger a re-compilation of the kernel when runtime-dimensions are used, and would have minimal effect on the kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that the i8 matmuls which typically go to cublasLT, which expects data in a blocked formats, falls back to the generic kernel for the binary_post_op. Hence the support of blocking over there.
cublasLT blocking does not depend on blocking_desc structure (which contain inner_blk), but on dedicated cublasLT_blocking_desc, which is essentially an enum and does not require inner_blk array.
Since the post_ops_t struct is used by other reference primitives as well, this would mandate a change in them as well.
Not sure I see the required changes to primitives themselves. Here is my thinking: the post_ops_t structure is initialized using the primitive_attr directly (see here), so I don't expect a change of internal API here, but just a change in implementation of sycl_post_ops_t (e.g. instead of containing an array of union here, it would contain one array for lightweight eltwise_po, one small array for bigger binary_po/prelu, and maybe an extra array for indices to interleave these two).
make test |
@@ -409,16 +410,23 @@ struct matmul_kernel_fwd_t { | |||
CTX_IN_SYCL_KERNEL_MEMORY(DNNL_ARG_ATTR_DROPOUT_PROBABILITY)) | |||
, po_args_(cgh, ctx, conf_.post_ops) {} | |||
|
|||
void operator()(::sycl::nd_item<1> item) const { | |||
void operator()(::sycl::nd_item<1> item, ::sycl::kernel_handler kh) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void operator()(::sycl::nd_item<1> item, ::sycl::kernel_handler kh) const { | |
void operator()(::sycl::nd_item<1> item, const ::sycl::kernel_handler &kh) const { |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kernel_handler cannot be const because the get_specialization_constant()
is not marked caused and would throw an error.
auto md_t_spec_const_pod_val = kh.get_specialization_constant< | ||
detail::matmul::md_t_spec_const_id>(); | ||
auto data_md = md_t_spec_const_pod_val.data_md_t; | ||
auto weights_md = md_t_spec_const_pod_val.weights_md_t; | ||
auto dst_md = md_t_spec_const_pod_val.dst_md_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const auto &
to remove unnecessary copies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unneeded copies removed.
auto weights_md = md_t_spec_const_pod_val.weights_md_t; | ||
auto dst_md = md_t_spec_const_pod_val.dst_md_t; | ||
|
||
memory_tensor_t data_mem(data_, data_md); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: for matmul it's called src
, not data
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data
renamed to src
src/gpu/generic/sycl/ref_matmul.cpp
Outdated
init_rt_conf(conf_, data_md_t, dst_md_t, weights_md_t, src_d, | ||
weights_d, dst_d, bias_d); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
init_rt_conf(conf_, data_md_t, dst_md_t, weights_md_t, src_d, | |
weights_d, dst_d, bias_d); | |
CHECK(init_rt_conf(conf_, data_md_t, dst_md_t, weights_md_t, src_d, | |
weights_d, dst_d, bias_d)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check was not needed as currently function used to assert, now changed to return status and added check
src/gpu/generic/sycl/ref_matmul.hpp
Outdated
auto init_status = init_conf(); | ||
|
||
return init_status; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto init_status = init_conf(); | |
return init_status; | |
return init_conf(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested change made.
src/xpu/sycl/types.hpp
Outdated
@@ -146,6 +177,21 @@ struct md_t { | |||
#undef CHECK_AND_ASSIGN | |||
} | |||
|
|||
md_t(const md_t_spec_const &mimicker) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
md_t(const md_t_spec_const &mimicker) | |
md_t(const md_t_spec_const &other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other
sounds better, change made.
Address comments Aaand its back! Addressed comments
ac9c4aa
to
fcaa982
Compare
Description
Currently the sycl implementation of matmul has a kernel argument size error when run on intel hardware, the size of the kernel arguments could not be trivially reduced, requiring a more complex solution. This PR introduces a fix for this by introducing spec constants for the source, weights and destination memory descriptors. This fix should be temporary while the best solution would be to leverage SYCL RTC to handle the argument size error, when available.
Additionally, due to the kernels being compiled for the spec constants we cannot have the values of source, weights and destination md change. This would mean that the sycl implementation cannot support runtime dimensions currently.
Checklist
General
make test
andmake test_benchdnn_*
) pass locally for each commit?