shared memory + atomic addition? #5894

Jaisiero · 2024-12-17T13:36:55Z

Hi there! I'm using shared memory and trying to increment it with an atomic operation in a compute shader but it's not working.
I have tried to simplify the code as much as possible. Could you guys take a look?

// Edit: I forgot this guys
static const uint RADIX_SORT_WORKGROUP_SIZE = 256; // assert WORKGROUP_SIZE >= RADIX_SORT_BINS
static const uint RADIX_SORT_BINS = 256;

// groupshared variables
groupshared uint _histogram[RADIX_SORT_BINS];

// Entry point with HLSL semantics
[numthreads(RADIX_SORT_WORKGROUP_SIZE, 1, 1)] void entry_radix_sort_histogram(uint3 pixel_i : SV_DispatchThreadID,
                                                                                                     uint3 GroupThreadID : SV_GroupThreadID,
                                                                                                                                  uint3 GroupID : SV_GroupID)
{
  uint lID = GroupThreadID.x;
  uint wID = GroupID.x;

  Ptr<uint> global_histograms =
      Ptr<uint>(RBRSH.task_head.global_histograms);

  Ptr<SimConfig> sim_config =
      Ptr<SimConfig>(RBRSH.task_head.sim_config);

  uint shift = sim_config->radix_shift;

  uint num_of_elements = sim_config->rigid_body_count;

  // Initialize histogram
  if (lID < RADIX_SORT_BINS)
  {
    _histogram[lID] = 0U;
  }
  GroupMemoryBarrierWithGroupSync();

  // Add to histogram
  if (lID < RADIX_SORT_BINS)
  {
    uint old_value = 0;
    InterlockedAdd(_histogram[lID], 1, old_value);
    // ++_histogram[lID]; <-- this works to increase count but I obviously need it in an atomic manner
  }
  GroupMemoryBarrierWithGroupSync();

  if (lID < RADIX_SORT_BINS)
  {
    uint count = _histogram[lID];
    global_histograms[RADIX_SORT_BINS * wID + lID] = count;
    if(count > 0)
      printf("Histogram global bin %d, bin %d, count %d\n", RADIX_SORT_BINS * wID + lID, lID, count);
  }
}

Expected value: It should return at least one (more than zero for sure, printf is performed when regular increment is used).
Value return: When I use atomics, value returned is always zero (not printf performed at all).

Thank you!

Jaime.

csyonghe · 2024-12-17T20:38:20Z

I wonder if it is a driver issue.

I have this simple code:

groupshared uint m[4];

[numthreads(4,1,1)]
void computeMain(int i : SV_GroupThreadID)
{
    m[i] = 0;
    GroupMemoryBarrierWithGroupSync();
    InterlockedAdd(m[i], 1);
    GroupMemoryBarrierWithGroupSync();
    if (m[i] != 0)
        printf("mem");
}

which generates:

; SPIR-V
; Version: 1.5
; Generator: Khronos; 40
; Bound: 38
; Schema: 0
               OpCapability Shader
               OpExtension "SPV_KHR_non_semantic_info"
         %33 = OpExtInstImport "NonSemantic.DebugPrintf"
               OpMemoryModel Logical GLSL450
               OpEntryPoint GLCompute %computeMain "main" %m %gl_LocalInvocationID
               OpExecutionMode %computeMain LocalSize 4 1 1
               OpSource Slang 1
         %35 = OpString "mem"
               OpName %m "m"
               OpName %computeMain "computeMain"
               OpDecorate %gl_LocalInvocationID BuiltIn LocalInvocationId
               OpDecorate %_arr_uint_int_4 ArrayStride 4
       %void = OpTypeVoid
          %3 = OpTypeFunction %void
       %uint = OpTypeInt 32 0
     %v3uint = OpTypeVector %uint 3
%_ptr_Input_v3uint = OpTypePointer Input %v3uint
        %int = OpTypeInt 32 1
      %int_4 = OpConstant %int 4
%_arr_uint_int_4 = OpTypeArray %uint %int_4
%_ptr_Workgroup__arr_uint_int_4 = OpTypePointer Workgroup %_arr_uint_int_4
%_ptr_Workgroup_uint = OpTypePointer Workgroup %uint
     %uint_0 = OpConstant %uint 0
     %uint_2 = OpConstant %uint 2
   %uint_264 = OpConstant %uint 264
     %uint_1 = OpConstant %uint 1
       %bool = OpTypeBool
%gl_LocalInvocationID = OpVariable %_ptr_Input_v3uint Input
          %m = OpVariable %_ptr_Workgroup__arr_uint_int_4 Workgroup
%computeMain = OpFunction %void None %3
          %4 = OpLabel
          %9 = OpLoad %v3uint %gl_LocalInvocationID
         %12 = OpCompositeExtract %uint %9 0
         %14 = OpBitcast %int %12
         %20 = OpAccessChain %_ptr_Workgroup_uint %m %14
               OpStore %20 %uint_0
               OpControlBarrier %uint_2 %uint_2 %uint_264
         %27 = OpAtomicIAdd %uint %20 %uint_1 %uint_0 %uint_1
               OpControlBarrier %uint_2 %uint_2 %uint_264
         %29 = OpLoad %uint %20
         %31 = OpINotEqual %bool %29 %uint_0
               OpSelectionMerge %6 None
               OpBranchConditional %31 %5 %6
          %5 = OpLabel
         %34 = OpExtInst %void %33 1 %35
               OpBranch %6
          %6 = OpLabel
               OpReturn
               OpFunctionEnd

Looking at this spirv, I am not able to identify anything suspicious. Can you update your driver etc. and try again? Which platform/hardware are you using?

csyonghe · 2024-12-17T20:40:55Z

I recommend using Slang's Atomic type for more explicit control on atomic load and store.

Can you try:

groupshared Atomic<uint> m[4];

[numthreads(4,1,1)]
void computeMain(int i : SV_GroupThreadID)
{
    m[i] = 0;
    GroupMemoryBarrierWithGroupSync();
    uint oldvalue;
    m[i] += 1;  // or equivalently: m[i].add(1);
    GroupMemoryBarrierWithGroupSync();
    if (m[i].load() != 0)
        printf("mem");
}

Jaisiero · 2024-12-18T10:06:39Z

@csyonghe Thank you very much for answering and I am really sorry that I didn't provide you the minimum code that reproduces the bug. Your code works.

Although, I think I found the correct combination of factors to catch it. This is it:

// groupshared variables
groupshared uint _histogram[4];

// Entry point with HLSL semantics
[numthreads(4, 1, 1)] void entry_radix_sort_histogram(int i : SV_DispatchThreadID)
{
  uint lID = i;

  // global histogram buffer
  Ptr<uint> global_histograms =
      Ptr<uint>(RBRSH.task_head.global_histograms);

  // Initialize histogram
  _histogram[lID] = 0U;
  GroupMemoryBarrierWithGroupSync();

  // Add to histogram
  InterlockedAdd(_histogram[lID], 1U);
  GroupMemoryBarrierWithGroupSync();

    uint count = _histogram[lID];
    global_histograms[lID] = count;    // <-- this is the problematic line
    if(count > 0)
      printf("Histogram global bin %d, count %d\n", lID, count);
}

It seems like writing into global memory is messing around with my shared memory. This global memory is large enough to allocate hundreds of 32bit registers thus I don't think I am writing out of bounds or something like that.

By the way, this is the spirv code:

[144]  %11 = OpFunction %3 None %12
Label 145
[Selection Header 145]
[145]  %13 = OpLabel
[146]  %17 = OpLoad %16 %19
[147]  %20 = OpCompositeExtract %6 %17 0
[148]  %22 = OpBitcast %21 %20
[149]  %23 = OpExtInst %3 %2 DebugLine %4 %24 %24 %25 %26
[150]  %27 = OpExtInst %3 %2 DebugLine %4 %24 %24 %25 %26
[151]  %28 = OpExtInst %3 %2 DebugLine %4 %24 %24 %25 %26
[152]  %29 = OpBitcast %6 %22
[153]  %30 = OpExtInst %3 %2 DebugLine %4 %31 %31 %25 %26
[154]  %45 = OpAccessChain %44 %42 %43
[155]  %48 = OpAccessChain %47 %45 %46
[156]  %49 = OpLoad %40 %48
[157]  %50 = OpExtInst %3 %2 DebugLine %4 %51 %51 %25 %26
[158]  %57 = OpAccessChain %56 %55 %29
[159]  OpStore %57 %58
[160]  %60 = OpExtInst %3 %2 DebugLine %4 %61 %61 %25 %26
[161]  OpControlBarrier %62 %62 %63
[162]  %65 = OpExtInst %3 %2 DebugLine %4 %66 %66 %25 %26
[163]  %67 = OpPtrAccessChain %40 %49 %29
[164]  %68 = OpAtomicIAdd %6 %57 %69 %58 %69
[165]  %70 = OpExtInst %3 %2 DebugLine %4 %71 %71 %25 %26
[166]  OpControlBarrier %62 %62 %63
[167]  %73 = OpExtInst %3 %2 DebugLine %4 %74 %74 %25 %26
[168]  %75 = OpLoad %6 %57
[169]  %76 = OpExtInst %3 %2 DebugLine %4 %77 %77 %25 %26
[170]  OpStore %67 %75 Aligned 4
[171]  %79 = OpExtInst %3 %2 DebugLine %4 %80 %80 %25 %26
[172]  %82 = OpUGreaterThan %81 %75 %58
[173]  OpSelectionMerge %15 None
[174]  OpBranchConditional %82 %14 %15
Label 175
[175]  %14 = OpLabel
[176]  %84 = OpExtInst %3 %2 DebugLine %4 %85 %85 %8 %86
[177]  %87 = OpExtInst %3 %88 DebugPrintf %89 %29 %75
[178]  OpBranch %15
Label 179
[Return 179] [Selection Merge 145]
[179]  %15 = OpLabel
[180]  OpReturn
[181]  OpFunctionEnd

I'll attach the whole spirv output in case you need it:
Radix Sort Histogram.comp.entry_radix_sort_histogram.spv.txt

I hope this sheds some light on the issue.

Edit: I am on Vulkan Instance Version: 1.3.280, Win11, RTX4090, Nvidia driver v.566.36

Jaime.

Jaisiero · 2024-12-18T17:30:30Z

I updated the sample to match your second one. it still doesn't give me the right anwser:

// groupshared variables
groupshared Atomic<uint> _histogram[4];

// Entry point with HLSL semantics
[numthreads(4, 1, 1)] void entry_radix_sort_histogram(int i : SV_DispatchThreadID)
{
  // global histogram buffer
  Ptr<uint> global_histograms =
      Ptr<uint>(RBRSH.task_head.global_histograms);

  // Initialize histogram
  _histogram[i] = 0U;
  GroupMemoryBarrierWithGroupSync();

  // Add to histogram
  _histogram[i] += 1U;
  GroupMemoryBarrierWithGroupSync();

  uint count = _histogram[i].load();
  global_histograms[i] = count;
  if(count > 0)
    printf("Histogram global bin %d, count %d\n", i, count);
}

Edit: If I increment global memory directly in my thread:

{
 // all the code is the same but this
 global_histograms[i] += 1U;
}

my memory is updated correctly:

Thus, I don't understand why those writes onto global memory causes a bad behaviour with my shared memory.

Thanks for your attention.

Jaime.

csyonghe · 2024-12-18T17:45:28Z

I feel you may be running into a driver bug.
Any chance you can test your shader on a GPU from a different vendor?

csyonghe · 2024-12-18T17:46:53Z

I would also try changing the barrier to AllMemoryBarrierWithGroupSync() and see if it changes anything.

Jaisiero · 2024-12-18T18:42:53Z

I would also try changing the barrier to AllMemoryBarrierWithGroupSync() and see if it changes anything.

I did use AllMemoryBarrierWithGroupSync as suggested but nothing changed. I will try to find someone who can test the code too because I have no other GPU available right now.

I'll keep you updated.

Jaime.

Jaisiero · 2025-01-15T23:36:56Z

Hi again @csyonghe,
I hope you are doing well.

I coded a simple project for Daxa API about this topic using GLSL & Slang:
shared mem sample

The GLSL version works, the Slang one doesn't.
I checked this by using Vulkan debugging layer.

Best regards,
Jaime.

csyonghe · 2025-01-16T05:06:34Z

Let me verify this on my machine and get back to you.

csyonghe · 2025-01-16T17:42:06Z

I created this test:

//TEST:COMPARE_COMPUTE(filecheck-buffer=CHECK): -shaderobj -vk

groupshared Atomic<uint> values[10];

//TEST_INPUT: set outputAddr = out ubuffer(data=[0 0 0 0], stride=4)
uniform uint64_t outputAddr;

[numthreads(4,1,1)]
void computeMain(int i : SV_DispatchThreadID)
{
    uint* output = (uint*)(outputAddr);
    values[i] = 0;
    GroupMemoryBarrierWithGroupSync();
    values[i] += 1;
    GroupMemoryBarrierWithGroupSync();
    output[i] = values[i].load();

    // CHECK: 1
    // CHECK: 1
    // CHECK: 1
    // CHECK: 1
}

The result of outputBuffer after running the shader is:

which suggests that the code is working properly. There could be other factors in your application that are causing this issue. I inspected the SPIRV and unable to find anything wrong about it.

I am on RTX4090 with NVIDIA driver 561.09.

csyonghe · 2025-01-16T17:58:29Z

I am seeing that you are using unoptimized code. Can you pass -O1 to slang and let spirv opt process the SPIRV first?

I've seen bugs within the driver when dealing with unoptimized spirv.

Update: I tried running unoptimized spirv for this test as well, and it still produces correct result.

Jaisiero · 2025-01-16T19:41:32Z

I installed that driver version:

The output of my buffer is:
0 0 0 0

That matches the fact that shared memory never gets updated.
So nothing has changed then I am not sure what the source of this issue is.

Update: I did another test. If I change those lines:
uint count = values[i].load(); output[i] += count;

For these:
uint count = values[i].load(); output[i] += 1;

My buffer gets updated:
8359 8359 8359 8359

csyonghe · 2025-01-16T19:43:40Z

So you ran the same test in PR #6107 and you see 0s?

Jaisiero · 2025-01-17T14:32:24Z

I have run your test and it works on my machine!

Do you think that bindless descriptors could be messing with shared memory somehow?
#3532

I was looking into the compilation flags that Daxa uses:
std::array<char const *, 4> cmd_args = { // https://github.com/shader-slang/slang/issues/3532 // Disables warning for aliasing bindings. // clang-format off "-warnings-disable", "39001", "-O0", "-g2", // clang-format on };

Ipotrick · 2025-01-17T17:38:26Z

I could now also replicate the issue jaisero showed.
This is my replication:

groupshared uint value;

struct Push
{
    uint* value;
}
[[vk::push_constant]] Push push;

[shader("compute")]
[numthreads(1, 1, 1)]
void entry_replication()
{
    value = 0;
    InterlockedAdd(value, 1);
    uint ptr_read_value = *push.value; // set to 0 by cpu.
    printf("value: %i\n", value + ptr_read_value);
}

It prints 0, but it should print 1.
The error only appears with pointer usage.
This is very similar to another slang github issue.

Config:

Slang Version: 2024.11
pc platform: x86-64msvc windows11
GPU: RTX4080, Driver: 566.03, Type: DCH
Compile options (cpp api):
slang::TargetDesc::format: SlangCompileTarget::SLANG_SPIRV
slang::TargetDesc::profile: glsl_450+spirv_1_4
slang::TargetDesc::flags: SLANG_TARGET_FLAG_GENERATE_SPIRV_DIRECTLY
slang::TargetDesc::forceGLSLScalarBufferLayout = true
slang::SessionDesc::defaultMatrixLayoutMode: SLANG_MATRIX_LAYOUT_COLUMN_MAJOR
slangRequest->processCommandLineArguments({"-warnings-disable", "39001","-O0"});

I also tried barriers and removing all the compile flags but nothing helped.

I suspect that vulkan validation layers could cause a difference in results.

Ipotrick · 2025-01-17T18:30:19Z

Update: using the latest slang i can not reproduce. It seems to work now.

Jaisiero · 2025-01-17T20:36:45Z

Confirmed: the newest Slang version solved the issue for me.

csyonghe · 2025-01-17T22:31:36Z

@Jaisiero Glad to know that this issue is resolved! Closing this now, and please let us know if you run into other problems.

Jaisiero · 2025-01-17T23:33:43Z

Thank you for helping! Much appreciate it.

bmillsNV added this to the Q1 2025 (Winter) milestone Dec 19, 2024

bmillsNV added goal:client support Feature or fix needed for a current slang user. Driver Bug labels Dec 19, 2024

csyonghe self-assigned this Jan 16, 2025

csyonghe mentioned this issue Jan 16, 2025

Add groupshared atomic array test. #6107

Merged

csyonghe closed this as completed Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shared memory + atomic addition? #5894

shared memory + atomic addition? #5894

Jaisiero commented Dec 17, 2024 •

edited

Loading

csyonghe commented Dec 17, 2024

csyonghe commented Dec 17, 2024

Jaisiero commented Dec 18, 2024 •

edited

Loading

Jaisiero commented Dec 18, 2024 •

edited

Loading

csyonghe commented Dec 18, 2024

csyonghe commented Dec 18, 2024

Jaisiero commented Dec 18, 2024 •

edited

Loading

Jaisiero commented Jan 15, 2025

csyonghe commented Jan 16, 2025

csyonghe commented Jan 16, 2025 •

edited

Loading

csyonghe commented Jan 16, 2025 •

edited

Loading

Jaisiero commented Jan 16, 2025 •

edited

Loading

csyonghe commented Jan 16, 2025

Jaisiero commented Jan 17, 2025

Ipotrick commented Jan 17, 2025 •

edited

Loading

Ipotrick commented Jan 17, 2025 •

edited

Loading

Jaisiero commented Jan 17, 2025

csyonghe commented Jan 17, 2025

Jaisiero commented Jan 17, 2025

shared memory + atomic addition? #5894

shared memory + atomic addition? #5894

Comments

Jaisiero commented Dec 17, 2024 • edited Loading

csyonghe commented Dec 17, 2024

csyonghe commented Dec 17, 2024

Jaisiero commented Dec 18, 2024 • edited Loading

Jaisiero commented Dec 18, 2024 • edited Loading

csyonghe commented Dec 18, 2024

csyonghe commented Dec 18, 2024

Jaisiero commented Dec 18, 2024 • edited Loading

Jaisiero commented Jan 15, 2025

csyonghe commented Jan 16, 2025

csyonghe commented Jan 16, 2025 • edited Loading

csyonghe commented Jan 16, 2025 • edited Loading

Jaisiero commented Jan 16, 2025 • edited Loading

csyonghe commented Jan 16, 2025

Jaisiero commented Jan 17, 2025

Ipotrick commented Jan 17, 2025 • edited Loading

Ipotrick commented Jan 17, 2025 • edited Loading

Jaisiero commented Jan 17, 2025

csyonghe commented Jan 17, 2025

Jaisiero commented Jan 17, 2025

Jaisiero commented Dec 17, 2024 •

edited

Loading

Jaisiero commented Dec 18, 2024 •

edited

Loading

Jaisiero commented Dec 18, 2024 •

edited

Loading

Jaisiero commented Dec 18, 2024 •

edited

Loading

csyonghe commented Jan 16, 2025 •

edited

Loading

csyonghe commented Jan 16, 2025 •

edited

Loading

Jaisiero commented Jan 16, 2025 •

edited

Loading

Ipotrick commented Jan 17, 2025 •

edited

Loading

Ipotrick commented Jan 17, 2025 •

edited

Loading