Some questions about setting the descriptor for wgmma. #2223

linuxlonelyeagle · 2025-04-05T13:31:40Z

linuxlonelyeagle
Apr 5, 2025

I'm working on how to write a single gemm using wgmma's PTX.But I'm running into some problems and this one is about the descriptor.I paused here for a long time.

I'm going to describe my progress on this issue in more detail below.

At first, my program ran successfully, but he didn't get the correct calculations.I found the calculations strange, the rest of the program should be fine, and I deduced that the mma's descriptor should not be set correctly.

I started researching how CUTE was made and I found make_gemm-desc in CUTE.It was too much of a pain in the ass to use, and then I ported him to a version that would run on the CPU.Then I init my a, b tensor.

  using TA = cute::half_t;
  using TB = cute::half_t;
  using TC = cute::half_t;
  constexpr int m = 64, n = 16, k = 16;

  thrust::host_vector<TA> h_A(m * k);
  thrust::host_vector<TB> h_B(n * k);
  thrust::host_vector<TC> h_C(m * n);

  // Initialize the tensors
  for (int j = 0; j < m * k; ++j)
    h_A[j] = TA(j);
  for (int j = 0; j < n * k; ++j)
    h_B[j] = TB(j);
 
  auto dA = make_stride(make_stride(Int<8>{}, Int<64>{}), make_stride(Int<1>{}, Int<512>{})); 
  auto dB = make_stride(make_stride(Int<8>{}, Int<64>{}), make_stride(Int<1>{}, Int<128>{})); 

  auto tensor_a = make_tensor(h_A.data(), make_shape(make_shape(Int<8>{}, Int<8>{}), make_shape(Int<8>{}, Int<2>{})),

The layout here references the https://docs.nvidia.com/cuda/parallel-thread-execution/#async-warpgroup-k-no-swizzle-tf32.

Then I used my make_gemm-desc(The code is the same as upstream, just for the output Leading dimension byte offset and Stride dimension byte offset) to get the Leading dimension byte offset and Stride dimension byte offset.

For a matrix lbo: 64, sbo: 8, for b Matrix lbo:16, ebo: 8.

Then I apply the Leading dimension byte offset and stride dimension byte offset to my program.It's still not running right.

I started trying to adjust the contents of the A and B matrices.

Take the A matrix as an example.

  0    0    0    0    0    0    0    0   32   32   32   32   32   32   32   32 
   0    0    0    0    0    0    0    0   32   32   32   32   32   32   32   32 
   1    1    1    1    1    1    1    1   33   33   33   33   33   33   33   33 
   1    1    1    1    1    1    1    1   33   33   33   33   33   33   33   33 
   2    2    2    2    2    2    2    2   34   34   34   34   34   34   34   34 
   2    2    2    2    2    2    2    2   34   34   34   34   34   34   34   34 
   3    3    3    3    3    3    3    3   35   35   35   35   35   35   35   35 
   3    3    3    3    3    3    3    3   35   35   35   35   35   35   35   35 
   4    4    4    4    4    4    4    4   36   36   36   36   36   36   36   36 
   4    4    4    4    4    4    4    4   36   36   36   36   36   36   36   36 
   ...
   31  ....

or

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 
...
63 ....

Can anyone see where I went wrong?I hope someone can help me, thank you very much.This is important to me, and understanding this issue facilitates my involvement in open source compilers such as LLVM/MLIR.Thanks all.

linuxlonelyeagle · 2025-04-05T13:34:27Z

linuxlonelyeagle
Apr 5, 2025
Author

@hwu36 @thakkarV I'm sorry to bother you guys. But I thought you guys might know what to do. I've been researching my opinion on this issue for over ten days now.

4 replies

thakkarV Apr 5, 2025
Collaborator

Its hard to help if you have written all your code from scratch. I would just print things from the CUTLASS Implementation and try to match your implementation to that to find bugs.

linuxlonelyeagle Apr 5, 2025
Author

I know it's hard to debug exactly what's wrong, but I still think I have a problem with my Leading dimension byte offset and Stride dimension byte offset.

linuxlonelyeagle Apr 5, 2025
Author

I was actually thinking my input was
A matrix is half_t a_smem[64][16] (m, k)

// A matrix content
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 
...
63 ....

B matrix is half_t b_smem[16][16] (n, k)

// B matrix content
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 
...
16 16 ....

A matrix lbo: 64, sbo: 8, for b Matrix lbo:16, sbo: 8.Is that correct?

linuxlonelyeagle Apr 5, 2025
Author

I think you're right, but I'm not terribly familiar with cute.

linuxlonelyeagle · 2025-04-05T13:42:27Z

linuxlonelyeagle
Apr 5, 2025
Author

https://docs.nvidia.com/cuda/parallel-thread-execution/#async-warpgroup-k-no-swizzle-tf32 .Other than that, I found errors in the doc.

8 replies

manishucsd Apr 5, 2025

Here is some code that shows you how and where to print layouts from for building a deeper understanding on this topic.

linuxlonelyeagle Apr 5, 2025
Author

Thank you very much, I think I need to study what's in the picture.

linuxlonelyeagle Apr 6, 2025
Author

Honestly, I didn't see how the LBO and SBO were calculated for the A and B matrix on your way.

linuxlonelyeagle Apr 6, 2025
Author

// my code
using namespace cute;
using namespace cute::SM90::GMMA;

using TA = cute::half_t;
using TB = cute::half_t;
using TC = cute::half_t;
using TI = cute::half_t;

constexpr int m = 64, n = 16, k = 16;

__global__ void single_gemm(TC* c) {
  __shared__ TA share_a[m][k];
  __shared__ TB share_b[n][k];
 constexpr int core_matrix_col = 128 / 16;
 constexpr int core_marrix_row = 8;

 auto shape_a = make_shape(make_shape(Int<core_marrix_row>{}, Int<m/core_marrix_row>{}), make_shape(Int<core_matrix_col>{}, Int<k/core_matrix_col>{}));
 auto shape_b = make_shape(make_shape(Int<core_marrix_row>{}, Int<n/core_marrix_row>{}), make_shape(Int<core_matrix_col>{}, Int<k/core_matrix_col>{}));
 auto stride_a = make_stride(make_stride(Int<8>{}, Int<8*core_matrix_col>{}), make_stride(Int<1>{}, Int<m*core_matrix_col>{}));
 auto stride_b = make_stride(make_stride(Int<8>{}, Int<8*core_matrix_col>{}), make_stride(Int<1>{}, Int<n*core_matrix_col>{}));
 auto tensor_a = make_tensor(make_smem_ptr((TA*)share_a), shape_a, stride_a);
 auto tensor_b = make_tensor(make_smem_ptr((TA*)share_b), shape_b, stride_b);
 auto desc_a = make_gmma_desc<Major::K>(tensor_a);
 auto desc_b =  make_gmma_desc<Major::K>(tensor_b);
 if (threadIdx.x == 0 && blockIdx.x == 0) {
  print(desc_a);
  printf("\n");
  print(desc_b);
 }
}
// it output

GmmaDescriptor: 0x0000000800400040
  start_addr :  0x0040
  leading_off:  0x0040 (64)
  stride_off :  0x0008 (8)
  base_offset:  0x0
  layout_type:  0x0 (INTERLEAVE)

GmmaDescriptor: 0x00000008001000c0
  start_addr :  0x00c0
  leading_off:  0x0010 (16)
  stride_off :  0x0008 (8)
  base_offset:  0x0
  layout_type:  0x0 (INTERLEAVE)

I set such parameters to my cuda program, but he still can't get the correct calculation
Actually, people are missing an important point.

When constructing tensor_a and tensor_b the true arrangement of their data is such that 0, 1, 2, 3, 4, 5, 6...they are consecutive.But tensor_a and tensor_b come with a layout, which is equivalent to saying that a remapping for the coordinates was performed and then the SBO and LBO were computed.

For my program, I don't have this layout. tensor_a and tensor_b's real addresses are consecutive, and the storage remains 0, 1, 3, 4, 5, 6... ,It directly uses the result of make_gemm_desc calculation in cute.

It's hard to say there's no difference in this.

linuxlonelyeagle Apr 6, 2025
Author

Next are some questions I'd like to ask.

Why can't I just pass in one of these tensors and calculate the SBO and LBO?
tensor<64x16xf16, stride:16, 1>, it must be set to tensor<(8x8),(8,2), stride:(8, 64),(1:512))>,I know this is structured to be consistent with what is in PTX, and if it's not set up this way it will report an error in cute.
The 64x64box share memory shouldn't be able to be used to calculate LBO and SBO?
@manishucsd If you just draw a picture and give an example, forget I said anything.But honestly 64x64xf16 this size of share memory's can vary to 256x16xf16. it's LBO and SBO are 256 and 16 respectively

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some questions about setting the descriptor for wgmma. #2223

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Some questions about setting the descriptor for wgmma. #2223

linuxlonelyeagle Apr 5, 2025

Replies: 2 comments · 12 replies

linuxlonelyeagle Apr 5, 2025 Author

thakkarV Apr 5, 2025 Collaborator

linuxlonelyeagle Apr 5, 2025 Author

linuxlonelyeagle Apr 5, 2025 Author

linuxlonelyeagle Apr 5, 2025 Author

linuxlonelyeagle Apr 5, 2025 Author

manishucsd Apr 5, 2025

linuxlonelyeagle Apr 5, 2025 Author

linuxlonelyeagle Apr 6, 2025 Author

linuxlonelyeagle Apr 6, 2025 Author

linuxlonelyeagle Apr 6, 2025 Author

linuxlonelyeagle
Apr 5, 2025

Replies: 2 comments 12 replies

linuxlonelyeagle
Apr 5, 2025
Author

thakkarV Apr 5, 2025
Collaborator

linuxlonelyeagle Apr 5, 2025
Author

linuxlonelyeagle Apr 5, 2025
Author

linuxlonelyeagle Apr 5, 2025
Author

linuxlonelyeagle
Apr 5, 2025
Author

linuxlonelyeagle Apr 5, 2025
Author

linuxlonelyeagle Apr 6, 2025
Author

linuxlonelyeagle Apr 6, 2025
Author

linuxlonelyeagle Apr 6, 2025
Author