Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shared memory + atomic addition? #5894

Closed
Jaisiero opened this issue Dec 17, 2024 · 19 comments
Closed

shared memory + atomic addition? #5894

Jaisiero opened this issue Dec 17, 2024 · 19 comments
Assignees
Labels
Driver Bug goal:client support Feature or fix needed for a current slang user.

Comments

@Jaisiero
Copy link

Jaisiero commented Dec 17, 2024

Hi there! I'm using shared memory and trying to increment it with an atomic operation in a compute shader but it's not working.
I have tried to simplify the code as much as possible. Could you guys take a look?

// Edit: I forgot this guys
static const uint RADIX_SORT_WORKGROUP_SIZE = 256; // assert WORKGROUP_SIZE >= RADIX_SORT_BINS
static const uint RADIX_SORT_BINS = 256;

// groupshared variables
groupshared uint _histogram[RADIX_SORT_BINS];

// Entry point with HLSL semantics
[numthreads(RADIX_SORT_WORKGROUP_SIZE, 1, 1)] void entry_radix_sort_histogram(uint3 pixel_i : SV_DispatchThreadID,
                                                                                                     uint3 GroupThreadID : SV_GroupThreadID,
                                                                                                                                  uint3 GroupID : SV_GroupID)
{
  uint lID = GroupThreadID.x;
  uint wID = GroupID.x;

  Ptr<uint> global_histograms =
      Ptr<uint>(RBRSH.task_head.global_histograms);

  Ptr<SimConfig> sim_config =
      Ptr<SimConfig>(RBRSH.task_head.sim_config);

  uint shift = sim_config->radix_shift;

  uint num_of_elements = sim_config->rigid_body_count;

  // Initialize histogram
  if (lID < RADIX_SORT_BINS)
  {
    _histogram[lID] = 0U;
  }
  GroupMemoryBarrierWithGroupSync();

  // Add to histogram
  if (lID < RADIX_SORT_BINS)
  {
    uint old_value = 0;
    InterlockedAdd(_histogram[lID], 1, old_value);
    // ++_histogram[lID]; <-- this works to increase count but I obviously need it in an atomic manner
  }
  GroupMemoryBarrierWithGroupSync();

  if (lID < RADIX_SORT_BINS)
  {
    uint count = _histogram[lID];
    global_histograms[RADIX_SORT_BINS * wID + lID] = count;
    if(count > 0)
      printf("Histogram global bin %d, bin %d, count %d\n", RADIX_SORT_BINS * wID + lID, lID, count);
  }
}

Expected value: It should return at least one (more than zero for sure, printf is performed when regular increment is used).
Value return: When I use atomics, value returned is always zero (not printf performed at all).

Thank you!

Jaime.

@csyonghe
Copy link
Collaborator

I wonder if it is a driver issue.

I have this simple code:

groupshared uint m[4];

[numthreads(4,1,1)]
void computeMain(int i : SV_GroupThreadID)
{
    m[i] = 0;
    GroupMemoryBarrierWithGroupSync();
    InterlockedAdd(m[i], 1);
    GroupMemoryBarrierWithGroupSync();
    if (m[i] != 0)
        printf("mem");
}

which generates:

; SPIR-V
; Version: 1.5
; Generator: Khronos; 40
; Bound: 38
; Schema: 0
               OpCapability Shader
               OpExtension "SPV_KHR_non_semantic_info"
         %33 = OpExtInstImport "NonSemantic.DebugPrintf"
               OpMemoryModel Logical GLSL450
               OpEntryPoint GLCompute %computeMain "main" %m %gl_LocalInvocationID
               OpExecutionMode %computeMain LocalSize 4 1 1
               OpSource Slang 1
         %35 = OpString "mem"
               OpName %m "m"
               OpName %computeMain "computeMain"
               OpDecorate %gl_LocalInvocationID BuiltIn LocalInvocationId
               OpDecorate %_arr_uint_int_4 ArrayStride 4
       %void = OpTypeVoid
          %3 = OpTypeFunction %void
       %uint = OpTypeInt 32 0
     %v3uint = OpTypeVector %uint 3
%_ptr_Input_v3uint = OpTypePointer Input %v3uint
        %int = OpTypeInt 32 1
      %int_4 = OpConstant %int 4
%_arr_uint_int_4 = OpTypeArray %uint %int_4
%_ptr_Workgroup__arr_uint_int_4 = OpTypePointer Workgroup %_arr_uint_int_4
%_ptr_Workgroup_uint = OpTypePointer Workgroup %uint
     %uint_0 = OpConstant %uint 0
     %uint_2 = OpConstant %uint 2
   %uint_264 = OpConstant %uint 264
     %uint_1 = OpConstant %uint 1
       %bool = OpTypeBool
%gl_LocalInvocationID = OpVariable %_ptr_Input_v3uint Input
          %m = OpVariable %_ptr_Workgroup__arr_uint_int_4 Workgroup
%computeMain = OpFunction %void None %3
          %4 = OpLabel
          %9 = OpLoad %v3uint %gl_LocalInvocationID
         %12 = OpCompositeExtract %uint %9 0
         %14 = OpBitcast %int %12
         %20 = OpAccessChain %_ptr_Workgroup_uint %m %14
               OpStore %20 %uint_0
               OpControlBarrier %uint_2 %uint_2 %uint_264
         %27 = OpAtomicIAdd %uint %20 %uint_1 %uint_0 %uint_1
               OpControlBarrier %uint_2 %uint_2 %uint_264
         %29 = OpLoad %uint %20
         %31 = OpINotEqual %bool %29 %uint_0
               OpSelectionMerge %6 None
               OpBranchConditional %31 %5 %6
          %5 = OpLabel
         %34 = OpExtInst %void %33 1 %35
               OpBranch %6
          %6 = OpLabel
               OpReturn
               OpFunctionEnd

Looking at this spirv, I am not able to identify anything suspicious. Can you update your driver etc. and try again? Which platform/hardware are you using?

@csyonghe
Copy link
Collaborator

I recommend using Slang's Atomic type for more explicit control on atomic load and store.

Can you try:

groupshared Atomic<uint> m[4];

[numthreads(4,1,1)]
void computeMain(int i : SV_GroupThreadID)
{
    m[i] = 0;
    GroupMemoryBarrierWithGroupSync();
    uint oldvalue;
    m[i] += 1;  // or equivalently: m[i].add(1);
    GroupMemoryBarrierWithGroupSync();
    if (m[i].load() != 0)
        printf("mem");
}

@Jaisiero
Copy link
Author

Jaisiero commented Dec 18, 2024

@csyonghe Thank you very much for answering and I am really sorry that I didn't provide you the minimum code that reproduces the bug. Your code works.

Although, I think I found the correct combination of factors to catch it. This is it:

// groupshared variables
groupshared uint _histogram[4];

// Entry point with HLSL semantics
[numthreads(4, 1, 1)] void entry_radix_sort_histogram(int i : SV_DispatchThreadID)
{
  uint lID = i;

  // global histogram buffer
  Ptr<uint> global_histograms =
      Ptr<uint>(RBRSH.task_head.global_histograms);

  // Initialize histogram
  _histogram[lID] = 0U;
  GroupMemoryBarrierWithGroupSync();

  // Add to histogram
  InterlockedAdd(_histogram[lID], 1U);
  GroupMemoryBarrierWithGroupSync();

    uint count = _histogram[lID];
    global_histograms[lID] = count;    // <-- this is the problematic line
    if(count > 0)
      printf("Histogram global bin %d, count %d\n", lID, count);
}

It seems like writing into global memory is messing around with my shared memory. This global memory is large enough to allocate hundreds of 32bit registers thus I don't think I am writing out of bounds or something like that.

By the way, this is the spirv code:

[144]  %11 = OpFunction %3 None %12
Label 145
[Selection Header 145]
[145]  %13 = OpLabel
[146]  %17 = OpLoad %16 %19
[147]  %20 = OpCompositeExtract %6 %17 0
[148]  %22 = OpBitcast %21 %20
[149]  %23 = OpExtInst %3 %2 DebugLine %4 %24 %24 %25 %26
[150]  %27 = OpExtInst %3 %2 DebugLine %4 %24 %24 %25 %26
[151]  %28 = OpExtInst %3 %2 DebugLine %4 %24 %24 %25 %26
[152]  %29 = OpBitcast %6 %22
[153]  %30 = OpExtInst %3 %2 DebugLine %4 %31 %31 %25 %26
[154]  %45 = OpAccessChain %44 %42 %43
[155]  %48 = OpAccessChain %47 %45 %46
[156]  %49 = OpLoad %40 %48
[157]  %50 = OpExtInst %3 %2 DebugLine %4 %51 %51 %25 %26
[158]  %57 = OpAccessChain %56 %55 %29
[159]  OpStore %57 %58
[160]  %60 = OpExtInst %3 %2 DebugLine %4 %61 %61 %25 %26
[161]  OpControlBarrier %62 %62 %63
[162]  %65 = OpExtInst %3 %2 DebugLine %4 %66 %66 %25 %26
[163]  %67 = OpPtrAccessChain %40 %49 %29
[164]  %68 = OpAtomicIAdd %6 %57 %69 %58 %69
[165]  %70 = OpExtInst %3 %2 DebugLine %4 %71 %71 %25 %26
[166]  OpControlBarrier %62 %62 %63
[167]  %73 = OpExtInst %3 %2 DebugLine %4 %74 %74 %25 %26
[168]  %75 = OpLoad %6 %57
[169]  %76 = OpExtInst %3 %2 DebugLine %4 %77 %77 %25 %26
[170]  OpStore %67 %75 Aligned 4
[171]  %79 = OpExtInst %3 %2 DebugLine %4 %80 %80 %25 %26
[172]  %82 = OpUGreaterThan %81 %75 %58
[173]  OpSelectionMerge %15 None
[174]  OpBranchConditional %82 %14 %15
Label 175
[175]  %14 = OpLabel
[176]  %84 = OpExtInst %3 %2 DebugLine %4 %85 %85 %8 %86
[177]  %87 = OpExtInst %3 %88 DebugPrintf %89 %29 %75
[178]  OpBranch %15
Label 179
[Return 179] [Selection Merge 145]
[179]  %15 = OpLabel
[180]  OpReturn
[181]  OpFunctionEnd

image

I'll attach the whole spirv output in case you need it:
Radix Sort Histogram.comp.entry_radix_sort_histogram.spv.txt

I hope this sheds some light on the issue.

Edit: I am on Vulkan Instance Version: 1.3.280, Win11, RTX4090, Nvidia driver v.566.36

Jaime.

@Jaisiero
Copy link
Author

Jaisiero commented Dec 18, 2024

I updated the sample to match your second one. it still doesn't give me the right anwser:

// groupshared variables
groupshared Atomic<uint> _histogram[4];

// Entry point with HLSL semantics
[numthreads(4, 1, 1)] void entry_radix_sort_histogram(int i : SV_DispatchThreadID)
{
  // global histogram buffer
  Ptr<uint> global_histograms =
      Ptr<uint>(RBRSH.task_head.global_histograms);

  // Initialize histogram
  _histogram[i] = 0U;
  GroupMemoryBarrierWithGroupSync();

  // Add to histogram
  _histogram[i] += 1U;
  GroupMemoryBarrierWithGroupSync();

  uint count = _histogram[i].load();
  global_histograms[i] = count;
  if(count > 0)
    printf("Histogram global bin %d, count %d\n", i, count);
}

Edit: If I increment global memory directly in my thread:

{
 // all the code is the same but this
 global_histograms[i] += 1U;
}

my memory is updated correctly:
image

Thus, I don't understand why those writes onto global memory causes a bad behaviour with my shared memory.

Thanks for your attention.

Jaime.

@csyonghe
Copy link
Collaborator

I feel you may be running into a driver bug.
Any chance you can test your shader on a GPU from a different vendor?

@csyonghe
Copy link
Collaborator

I would also try changing the barrier to AllMemoryBarrierWithGroupSync() and see if it changes anything.

@Jaisiero
Copy link
Author

Jaisiero commented Dec 18, 2024

I would also try changing the barrier to AllMemoryBarrierWithGroupSync() and see if it changes anything.

I did use AllMemoryBarrierWithGroupSync as suggested but nothing changed. I will try to find someone who can test the code too because I have no other GPU available right now.

I'll keep you updated.

Jaime.

@bmillsNV bmillsNV added this to the Q1 2025 (Winter) milestone Dec 19, 2024
@bmillsNV bmillsNV added goal:client support Feature or fix needed for a current slang user. Driver Bug labels Dec 19, 2024
@Jaisiero
Copy link
Author

Hi again @csyonghe,
I hope you are doing well.

I coded a simple project for Daxa API about this topic using GLSL & Slang:
shared mem sample

The GLSL version works, the Slang one doesn't.
I checked this by using Vulkan debugging layer.

Best regards,
Jaime.

@csyonghe csyonghe self-assigned this Jan 16, 2025
@csyonghe
Copy link
Collaborator

Let me verify this on my machine and get back to you.

@csyonghe
Copy link
Collaborator

csyonghe commented Jan 16, 2025

I created this test:

//TEST:COMPARE_COMPUTE(filecheck-buffer=CHECK): -shaderobj -vk

groupshared Atomic<uint> values[10];

//TEST_INPUT: set outputAddr = out ubuffer(data=[0 0 0 0], stride=4)
uniform uint64_t outputAddr;

[numthreads(4,1,1)]
void computeMain(int i : SV_DispatchThreadID)
{
    uint* output = (uint*)(outputAddr);
    values[i] = 0;
    GroupMemoryBarrierWithGroupSync();
    values[i] += 1;
    GroupMemoryBarrierWithGroupSync();
    output[i] = values[i].load();

    // CHECK: 1
    // CHECK: 1
    // CHECK: 1
    // CHECK: 1
}

The result of outputBuffer after running the shader is:

1
1
1
1

which suggests that the code is working properly. There could be other factors in your application that are causing this issue. I inspected the SPIRV and unable to find anything wrong about it.

I am on RTX4090 with NVIDIA driver 561.09.

@csyonghe
Copy link
Collaborator

csyonghe commented Jan 16, 2025

I am seeing that you are using unoptimized code. Can you pass -O1 to slang and let spirv opt process the SPIRV first?

I've seen bugs within the driver when dealing with unoptimized spirv.

Update: I tried running unoptimized spirv for this test as well, and it still produces correct result.

@Jaisiero
Copy link
Author

Jaisiero commented Jan 16, 2025

I installed that driver version:
image

The output of my buffer is:
0 0 0 0

That matches the fact that shared memory never gets updated.
So nothing has changed then I am not sure what the source of this issue is.

Update: I did another test. If I change those lines:
uint count = values[i].load(); output[i] += count;

For these:
uint count = values[i].load(); output[i] += 1;

My buffer gets updated:
8359 8359 8359 8359

@csyonghe
Copy link
Collaborator

So you ran the same test in PR #6107 and you see 0s?

@Jaisiero
Copy link
Author

I have run your test and it works on my machine!

Image

Do you think that bindless descriptors could be messing with shared memory somehow?
#3532

I was looking into the compilation flags that Daxa uses:
std::array<char const *, 4> cmd_args = { // https://github.com/shader-slang/slang/issues/3532 // Disables warning for aliasing bindings. // clang-format off "-warnings-disable", "39001", "-O0", "-g2", // clang-format on };

@Ipotrick
Copy link

Ipotrick commented Jan 17, 2025

I could now also replicate the issue jaisero showed.
This is my replication:

groupshared uint value;

struct Push
{
    uint* value;
}
[[vk::push_constant]] Push push;

[shader("compute")]
[numthreads(1, 1, 1)]
void entry_replication()
{
    value = 0;
    InterlockedAdd(value, 1);
    uint ptr_read_value = *push.value; // set to 0 by cpu.
    printf("value: %i\n", value + ptr_read_value);
}

It prints 0, but it should print 1.
The error only appears with pointer usage.
This is very similar to another slang github issue.

Config:

Slang Version: 2024.11
pc platform: x86-64msvc windows11
GPU: RTX4080, Driver: 566.03, Type: DCH
Compile options (cpp api):
slang::TargetDesc::format: SlangCompileTarget::SLANG_SPIRV
slang::TargetDesc::profile: glsl_450+spirv_1_4
slang::TargetDesc::flags: SLANG_TARGET_FLAG_GENERATE_SPIRV_DIRECTLY
slang::TargetDesc::forceGLSLScalarBufferLayout = true
slang::SessionDesc::defaultMatrixLayoutMode: SLANG_MATRIX_LAYOUT_COLUMN_MAJOR
slangRequest->processCommandLineArguments({"-warnings-disable", "39001","-O0"});

I also tried barriers and removing all the compile flags but nothing helped.

I suspect that vulkan validation layers could cause a difference in results.

@Ipotrick
Copy link

Ipotrick commented Jan 17, 2025

Update: using the latest slang i can not reproduce. It seems to work now.

@Jaisiero
Copy link
Author

Confirmed: the newest Slang version solved the issue for me.

@csyonghe
Copy link
Collaborator

@Jaisiero Glad to know that this issue is resolved! Closing this now, and please let us know if you run into other problems.

@Jaisiero
Copy link
Author

Thank you for helping! Much appreciate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Driver Bug goal:client support Feature or fix needed for a current slang user.
Projects
None yet
Development

No branches or pull requests

4 participants