generic: sycl: Introduce spec constants for sycl matmul #2250

ShanoToni · 2024-12-11T17:11:11Z

Description

Currently the sycl implementation of matmul has a kernel argument size error when run on intel hardware, the size of the kernel arguments could not be trivially reduced, requiring a more complex solution. This PR introduces a fix for this by introducing spec constants for the source, weights and destination memory descriptors. This fix should be temporary while the best solution would be to leverage SYCL RTC to handle the argument size error, when available.

Additionally, due to the kernels being compiled for the spec constants we cannot have the values of source, weights and destination md change. This would mean that the sycl implementation cannot support runtime dimensions currently.

Checklist

General

[ x ] Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
[ x ] Have you formatted the code using clang-format?

dzarukin · 2024-12-11T17:35:42Z

tests/benchdnn/dnnl_common.hpp

+#ifdef DNNL_SYCL_GENERIC
+                    // skip unimplemented configs for sycl impl
+                    || is_gpu()
+#endif


IIRC, the expectation was generic kernels cover oneDNN features completely, thus, all unimplemented cases should be properly reported by benchdnn and addressed by developers. Has this been changed?

@mgouicem, @vpirogov

This is indeed the long term goal. As the generic backend is still a work in progress and new features are being added to oneDNN it is useful to be able to skip some configurations that are not supported yet.

That is correct, this has been removed.

Removed it a bit prematurely, its added back in until a conclusion is reached

I don't see any unimplemented hits in current validation. It makes me think if there are any new unimplemented cases, they must be local to these changes. I can provide more guidance if there's a pointer on or a snapshot of what's going on.

Without this change running the oneDNN compiled with generic vendor (sycl ref matmul used) causes benchdnn to fail with test unimplemented error when testing with runtime dims, spec constants require kernel to be compiled for the specific spec constants (specific sizes for src/dst/weights) so current approach does now support runtime dims.
We are wondering if this can should be reported as skipped by benchdnn, as it is expected.

This sounds to me like a matmul-specific skip.
I propose to introduce is_generic_gpu() method identically to is_nvidia_gpu(), and put a check here (before the if() {} section).

The change in this PR will silence ALL drivers with ANY unimplemented hit which is not the desired behavior in a long term.
The practice to mark unimplemented as skipped is to figure out the minimal scope all of them fit and skip only for it. In your case is sounds like any runtime dimensions involved. You may relate to this statement when crafting it

dzarukin · 2024-12-11T17:37:19Z

src/xpu/sycl/types.hpp

@@ -146,6 +177,23 @@ struct md_t {
 #undef CHECK_AND_ASSIGN
    }

+    // This constructor is to be called inside the kernel


How do you control this copy constructor will be called only from the inside the kernel and not from other places?

I have removed this comment as this seems to be more misleading than helpful, the idea of the new constructor is to allow the spec constant struct to be convertible to the sycl version of the memory descriptor md_t, not to discourage use outside of the kernel, which would be fine.

dzarukin · 2024-12-11T17:38:45Z

src/xpu/sycl/types.hpp

@@ -146,6 +177,23 @@ struct md_t {
 #undef CHECK_AND_ASSIGN
    }

+    // This constructor is to be called inside the kernel
+    md_t(md_t_spec_const &mimicker) {


Suggested change

md_t(md_t_spec_const &mimicker) {

md_t(const md_t_spec_const &mimicker) : data_type_(...), ... {

for () {

...

}

}

Made the suggested change

dzarukin · 2024-12-11T17:40:38Z

src/xpu/sycl/types.hpp

+    dim32_t inner_nblks_;
+    dims32_t inner_blks_;
+    dims32_t inner_idxs_;
+};


// This struct cannot have a non trivial constructor, or any non trivial types, because...
// why exactly?
static_assert(std::is_trivial_v<md_t_spec_const>, "md_t_spec_const must be trivial");

From the sycl specification HERE we need the spec constant to be device_copyable HERE.

I can clarify this in the comment if you think that makes sense, or remove it all together?

Looking into this a bit more DPC++ defines SYCL_DEVICE_COPYABLE=1 so we may be able to avoid the need of md_t_spec_const if we specialize is_device_copyable_v<md_t> to return true.

All the knowledge not available from reading the code better be put in a comment as a reference with exact spot to look or as a summary.

mgouicem · 2024-12-13T14:39:13Z

src/gpu/generic/sycl/ref_matmul.cpp

@@ -52,16 +55,56 @@ void ref_matmul_t::pd_t::init_conf() {
    for (const auto &mdw : {src_d, weights_d, dst_d, bias_d}) {
        if (mdw.has_runtime_dims()) {
            any_runtime_params_ = true;
-            return;
+            return status::unimplemented;


(minor) please don't hesitate to use verbose macros for this kind of checks.

Great suggestion, made the change and prefer it this way as well, thank you

mgouicem · 2024-12-13T14:42:20Z

src/gpu/generic/sycl/ref_matmul.cpp

+        memory_desc_wrapper mdw(md);
+
+        assert(mdw.format_kind() == format_kind::blocked);
+        assert(mdw.ndims() <= max_dims);


given that this function can fail, I would advocate to return a status instead of using asserts, and propagate by it in init_conf.

Asserts changed to verbose checks, propagating status_t now as well.

mgouicem · 2024-12-13T14:48:38Z

src/gpu/generic/sycl/ref_matmul.cpp

    parallel_for(ctx, kernel_, [&](::sycl::handler &cgh) {
-        matmul_kernel_fwd_t matmul_kernel(conf, cgh, ctx);
+        matmul_kernel_fwd_t matmul_kernel(pd()->conf_, cgh, ctx);


As we discussed, using specialization constants will trigger recompilation during execute function.
TBH, I am a bit bothered by the precedent this would set, as multiple users rely on the property that after primitive creation, there should be no recurrent jitting overheads. Are we sure there is no other venue to avoid the argument size limitation?

I don't follow, we discussed that the spec constant would only trigger recompilation if we supported runtime dimensions. We discussed it does not make sense to support runtime dimensions so in the current patch the kernel compilation should only happen in the init function.
From what I gathered there are no great solutions in the short term:

Reducing the amount of arguments given to the kernel. We already use a struct as an argument to only give the arguments that may be used in the SYCL kernels.

Reducing the number of post-op supported: we already support "only" 5 post-ops. I wouldn't mind reducing this further but this may disable more tests. My understanding was that we didn't want to reduce this further.

The long term solution is to use SYCL-RTC but this will be a large task. The spec constants are useful to be able to run more tests with the SYCL backend until SYCL-RTC is implemented.

I don't follow, we discussed that the spec constant would only trigger recompilation if we supported runtime dimensions. We discussed it does not make sense to support runtime dimensions

I guess I missed that part. Given that this is a reference implementation, I would expect that at some point it would support runtime dimensions. Solving the current argument size limitation issue with specialization_constant would defer runtime dimension support until SYCL-RTC is enabled. Is that something we are fine with?

Reducing the amount of arguments given to the kernel. We already use a struct as an argument to only give the arguments that may be used in the SYCL kernels.

IIUC, sycl_matmul_conf_t is used to pass argument shapes to the kernel right?
If so, it seems to use sycl_post_ops_t which is a union type, for which the biggest members are prelu_post_op with ~100 Bytes and binary_post_op with ~160 Bytes. So 5 post_ops gets us to ~800B, do I get that right?
(which is about 1/4 of argument size limit).

In any case, here are a few suggestions:

for prelu_post_op, using dim_t[xpu::sycl::md_t::max_dims] instead of dims_t for the strides array would half that size.

for binary_post_op, we could use a couple of arrays for dims and strides instead of using md_t?

In sycl::md_t, we have about 50B related to blocking (inner_block, here). Do we actually use blocked layouts in generic sycl kernels? If not that would shape about 250B from post_ops, and ~500B from matmul_conf.

separate post-ops that require extra md from the others, so that we can reduce the number of post-ops that require passing md to kernel (e.g. binary/prelu), while not restricting others (e.g. eltwise).

Reducing the number of post-op supported: we already support "only" 5 post-ops. I wouldn't mind reducing this further but this may disable more tests. My understanding was that we didn't want to reduce this further.

Correct 5 is already low. However it seems we are using a union for all postops. Splitting eltwise from binary/prelu postop could help limit only the number of binary/prelu post-ops (e.g. to 2 binary/prelu post-ops), while not restricting the number of eltwise post-ops. That could be an option as well.

Thanks for the detailed suggestion. From previous investigations it did not seem possible to reduce enough the arguments size but we could investigate again. I'm not confident we can remove enough members from the binary post op.
Just one clarification, the argument size limit is 2kB so the 5 post-ops would represents about 40% of the total size. We need to shave off a total of 282B or at least 57B per post-op. I haven't looked into this myself but this seems difficult just for the binary post op which is apparently the biggest issue here.

Solving the current argument size limitation issue with specialization_constant would defer runtime dimension support until SYCL-RTC is enabled. Is that something we are fine with?

This sounds fine to me. From what I understand runtime dimensions are not that important if they are just meant for optimization purposes but the SYCL backend is not optimized yet. At this point it is easier to use spec-constant. This is just a workaround for us to be able to run tests and have a clearer picture of what operations are working as expected.

Hi @mgouicem , Thank you for your detailed suggestions. The motivation for going ahead with spec-constants was to keep the changes as minimal as possible, which would be fairly simple to revert and do not involve considerable refactoring / changes for one specific kernel which lead to code duplication, or changing the "core" of oneDNN, i.e. the de-facto way of doing things in oneDNN. The motivation of avoiding this is because we expect all of these changes to be reverted once SYCL-RTC is in place.

In sycl::md_t, we have about 50B related to blocking (inner_block, here). Do we actually use blocked layouts in generic sycl kernels? If not that would shape about 250B from post_ops, and ~500B from matmul_conf.

I believe that the i8 matmuls which typically go to cublasLT, which expects data in a blocked formats, falls back to the generic kernel for the binary_post_op. Hence the support of blocking over there.

separate post-ops that require extra md from the others, so that we can reduce the number of post-ops that require passing md to kernel (e.g. binary/prelu), while not restricting others (e.g. eltwise).

Since the post_ops_t struct is used by other reference primitives as well, this would mandate a change in them as well, and if something similar were to be developed just for the matmul kernel, it would lead to a lot of code duplication, hence it may not be viable.

for binary_post_op, we could use a couple of arrays for dims and strides instead of using md_t?

The kernel uses the off_v and the off_v_masked methods to calculate certain offsets, and those functions take care of padding if any etc etc. So that logic would need to be copied into another function if we were to isolate these two arrays into a struct of their own

If specialization constants is not preferable solution to this problem, we were also thinking of moving the kernel arguments to global memory. This would not trigger a re-compilation of the kernel when runtime-dimensions are used, and would have minimal effect on the kernel.

I believe that the i8 matmuls which typically go to cublasLT, which expects data in a blocked formats, falls back to the generic kernel for the binary_post_op. Hence the support of blocking over there.

cublasLT blocking does not depend on blocking_desc structure (which contain inner_blk), but on dedicated cublasLT_blocking_desc, which is essentially an enum and does not require inner_blk array.

Since the post_ops_t struct is used by other reference primitives as well, this would mandate a change in them as well.

Not sure I see the required changes to primitives themselves. Here is my thinking: the post_ops_t structure is initialized using the primitive_attr directly (see here), so I don't expect a change of internal API here, but just a change in implementation of sycl_post_ops_t (e.g. instead of containing an array of union here, it would contain one array for lightweight eltwise_po, one small array for bigger binary_po/prelu, and maybe an extra array for indices to interleave these two).

mgouicem · 2024-12-13T14:52:23Z

make test
disable device_cpu
enable device_gpu
enable thr_cuda
enable thr_generic
enable arch_rtx

dzarukin · 2024-12-13T19:39:10Z

src/gpu/generic/sycl/matmul_kernels.hpp

@@ -409,16 +410,23 @@ struct matmul_kernel_fwd_t {
                  CTX_IN_SYCL_KERNEL_MEMORY(DNNL_ARG_ATTR_DROPOUT_PROBABILITY))
        , po_args_(cgh, ctx, conf_.post_ops) {}

-    void operator()(::sycl::nd_item<1> item) const {
+    void operator()(::sycl::nd_item<1> item, ::sycl::kernel_handler kh) const {


Suggested change

void operator()(::sycl::nd_item<1> item, ::sycl::kernel_handler kh) const {

void operator()(::sycl::nd_item<1> item, const ::sycl::kernel_handler &kh) const {

?

The kernel_handler cannot be const because the get_specialization_constant() is not marked caused and would throw an error.

dzarukin · 2024-12-13T19:39:42Z

src/gpu/generic/sycl/matmul_kernels.hpp

+        auto md_t_spec_const_pod_val = kh.get_specialization_constant<
+                detail::matmul::md_t_spec_const_id>();
+        auto data_md = md_t_spec_const_pod_val.data_md_t;
+        auto weights_md = md_t_spec_const_pod_val.weights_md_t;
+        auto dst_md = md_t_spec_const_pod_val.dst_md_t;


const auto & to remove unnecessary copies?

Unneeded copies removed.

dzarukin · 2024-12-13T19:40:15Z

src/gpu/generic/sycl/matmul_kernels.hpp

+        auto weights_md = md_t_spec_const_pod_val.weights_md_t;
+        auto dst_md = md_t_spec_const_pod_val.dst_md_t;
+
+        memory_tensor_t data_mem(data_, data_md);


Nit: for matmul it's called src, not data.

data renamed to src

dzarukin · 2024-12-13T19:40:55Z

src/gpu/generic/sycl/ref_matmul.cpp

+    init_rt_conf(conf_, data_md_t, dst_md_t, weights_md_t, src_d,
+            weights_d, dst_d, bias_d);


Suggested change

init_rt_conf(conf_, data_md_t, dst_md_t, weights_md_t, src_d,

weights_d, dst_d, bias_d);

CHECK(init_rt_conf(conf_, data_md_t, dst_md_t, weights_md_t, src_d,

weights_d, dst_d, bias_d));

Check was not needed as currently function used to assert, now changed to return status and added check

dzarukin · 2024-12-13T19:42:28Z

src/gpu/generic/sycl/ref_matmul.hpp

+            auto init_status = init_conf();
+
+            return init_status;


Suggested change

auto init_status = init_conf();

return init_status;

return init_conf();

Suggested change made.

dzarukin · 2024-12-13T19:43:54Z

src/xpu/sycl/types.hpp

@@ -146,6 +177,21 @@ struct md_t {
 #undef CHECK_AND_ASSIGN
    }

+    md_t(const md_t_spec_const &mimicker)


Nit:

Suggested change

md_t(const md_t_spec_const &mimicker)

md_t(const md_t_spec_const &other)

other sounds better, change made.

Address comments Aaand its back! Addressed comments

ShanoToni requested review from a team as code owners December 11, 2024 17:11

github-actions bot added the platform:gpu-generic Codeowner: @oneapi-src/onednn-gpu-generic label Dec 11, 2024

dzarukin reviewed Dec 11, 2024

View reviewed changes

ShanoToni force-pushed the atharva/sycl_matmul_conf_size branch 2 times, most recently from 55e83bd to ac9c4aa Compare December 12, 2024 15:55

mgouicem reviewed Dec 13, 2024

View reviewed changes

dzarukin reviewed Dec 13, 2024

View reviewed changes

generic: sycl: Introduce spec constants for sycl matmul

fcaa982

Address comments Aaand its back! Addressed comments

ShanoToni force-pushed the atharva/sycl_matmul_conf_size branch from ac9c4aa to fcaa982 Compare December 18, 2024 11:00

github-actions bot added the component:tests Codeowner: @oneapi-src/onednn-arch label Dec 18, 2024

-    md_t(md_t_spec_const &mimicker) {
+    md_t(const md_t_spec_const &mimicker) : data_type_(...), ... {
+        for () {
+            ...
+        }
+    }

	void operator()(::sycl::nd_item<1> item, ::sycl::kernel_handler kh) const {
	void operator()(::sycl::nd_item<1> item, const ::sycl::kernel_handler &kh) const {

		init_rt_conf(conf_, data_md_t, dst_md_t, weights_md_t, src_d,
		weights_d, dst_d, bias_d);

	md_t(const md_t_spec_const &mimicker)
	md_t(const md_t_spec_const &other)

generic: sycl: Introduce spec constants for sycl matmul #2250

Are you sure you want to change the base?

generic: sycl: Introduce spec constants for sycl matmul #2250

Conversation

ShanoToni commented Dec 11, 2024

Description

Checklist

General

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgouicem Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgouicem commented Dec 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgouicem Dec 17, 2024 •

edited

Loading