Skip to content

Commit 2fffbc5

Browse files
jsmall-zzzTim Foley
and
Tim Foley
authored
Int64 atomic add RWByteAddressBuffer support (shader-slang#1504)
* Fix premake5.lua so it uses the new path needed for OpenCLDebugInfo100.h * Keep including the includes directory. * Added the spirv-tools-generated files. * We don't need to include the spirv/unified1 path because the files needed are actually in the spirv-tools-generated folder. * Put the build_info.h glslang generated files in external/glslang-generated. Alter premake5.lua to pick up that header. * First pass at documenting how to build glslang and spirv-tools. * Improved glsl/spir-v tools README.md * Added revision.h * Change how gResources is calculated. Update about revision.h * Update docs a little. * Split out spirv-tools into a separate project for building glslang. This was not necessary on linux, but *is* necessary on windows, because there is a file disassemble.cpp in spirv-tools and in glslang, and this leads to VS choosing only one. With the separate library, the problem is resolved. * Fix direct-spirv-emit output. * Update to latest version of spirv headers and spirv-tools. * Upgrade submodule version of glslang in external. * Add fPIC to build options of slang-spirv-tools * WIP adding support for InterlockedAddFp32 * Upgrade slang-binaries to have new glslang. * Fix issues with Windows slang-glslang binaries, via update of slang-binaries used. * WIP - atomicAdd. This solution can't work as we can't do (float*) in glsl. * WIP on atomic float ops. * Added checking for multiple decls that takes into account __target_intrinsic and __specialized_for_target. First pass impl of atomic add on float for glsl. * Split __atomicAdd so extensions are applied appropriately. * Made Dxc/Fxc support includes. Use HLSL prelude to pass the path to nvapi Added -nv-api-path * Refactor around IncludeHandler and impl of IncludeSystem * slang-include-handler -> slang-include-system Have IncludeHandler/Impl defined in slang-preprocessor * Small comment improvements. * Document atomic float add addition in target-compatibility.md. * CUDA float atomic support on RWByteAddressBuffer. * Add atomic-float-byte-address-buffer-cross.slang * Removed inappropriate-once.slang - the test is no longer valid when a file is loaded and has a unique identity by default. A test could be made, but would require an API call to create the file (so no unique id). Improved handling of loadFile - uses uniqueId if has one. * Work around for testing target overlaps - to avoid exceptions on adding targets. Simplify PathInfo setup. Modify single-target-intrinsic.slang - it no longer failed because there were no longer multiple definitions for the same target. * Int64 atomic add RwByteAddressBuffer support. * Fix typo in stdlib for int atomic ByteAddressBuffer. * Small fixes to int64 atomic test. Co-authored-by: Tim Foley <tfoleyNV@users.noreply.github.com>
1 parent b820f34 commit 2fffbc5

5 files changed

+99
-5
lines changed

docs/target-compatibility.md

+5-3
Original file line numberDiff line numberDiff line change
@@ -187,13 +187,15 @@ Currently feature allows atomic float additions on RWByteAddressBuffer. A future
187187
```
188188
void RWByteAddressBuffer::InterlockedAddFp32(uint byteAddress, float valueToAdd, out float originalValue);
189189
void RWByteAddressBuffer::InterlockedAddFp32(uint byteAddress, float valueToAdd);
190+
191+
void RWByteAddressBuffer::InterlockedAddI64(uint byteAddress, int64_t valueToAdd, out int64_t originalValue);
192+
void RWByteAddressBuffer::InterlockedAddI64(uint byteAddress, int64_t valueToAdd);
190193
```
191194

192195
On HLSL based targets this functionality is achieved using [nvAPI](https://developer.nvidia.com/nvapi) based functionality. Therefore for the feature to work you must have nvAPI installed on your system. Then the 'prelude' functionality allows via the API for an include (or the text) of the relevent files. To see how to do this in practice look at the function `setSessionDefaultPrelude`. This makes the prelude for HLSL hold an include to the *absolute* path to the required include file `nvHLSLExtns.h`. As an absolute path is used, it means other includes that includes, look in the correct place without having to set up special include paths.
193196

194197
To use nvAPI it is nessary to specify a unordered access views (UAV) based 'u' register that will be used to communicate with nvAPI. Note! Slang does not do any special handling around this, it will be necessary for application code to ensure the UAV is either guarenteed to not collide with what Slang assigns, or it's specified (but not used) in the Slang source. The u register number has to be specified also to the nvAPI runtime library.
195198

196-
On Vulkan, the [`GL_EXT_shader_atomic_float`](https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VK_EXT_shader_atomic_float.html) extension is required.
197-
198-
199+
On Vulkan, for float the [`GL_EXT_shader_atomic_float`](https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VK_EXT_shader_atomic_float.html) extension is required. For int64 the [`GL_EXT_shader_atomic_int64`](https://raw.githubusercontent.com/KhronosGroup/GLSL/master/extensions/ext/GL_EXT_shader_atomic_int64.txt) extension is required.
199200

201+
CUDA requires SM6.0 or higher for int64 support.

source/slang/hlsl.meta.slang

+57-1
Original file line numberDiff line numberDiff line change
@@ -56,9 +56,12 @@ struct ByteAddressBuffer
5656
__target_intrinsic(glsl, "atomicAdd($0, $1)")
5757
__glsl_version(430)
5858
__glsl_extension(GL_EXT_shader_atomic_float)
59-
//__glsl_extension(GL_EXT_gpu_shader5)
6059
float __atomicAdd(__ref float value, float amount);
6160

61+
// Helper for hlsl, using nvAPI
62+
__target_intrinsic(hlsl, "NvInterlockedAddUint64($0, $1, $2)")
63+
uint2 __atomicAdd(RWByteAddressBuffer buf, uint offset, uint2);
64+
6265
// Int versions require glsl 4.30
6366
// https://www.khronos.org/registry/OpenGL-Refpages/gl4/html/atomicAdd.xhtml
6467

@@ -70,6 +73,10 @@ __target_intrinsic(glsl, "atomicAdd($0, $1)")
7073
__glsl_version(430)
7174
uint __atomicAdd(__ref uint value, uint amount);
7275

76+
__target_intrinsic(glsl, "atomicAdd($0, $1)")
77+
__glsl_version(430)
78+
__glsl_extension(GL_EXT_shader_atomic_int64)
79+
int64_t __atomicAdd(__ref int64_t value, int64_t amount);
7380

7481
__intrinsic_op($(kIROp_ByteAddressBufferLoad))
7582
T __byteAddressBufferLoad<T>(ByteAddressBuffer buffer, int offset);
@@ -192,6 +199,9 @@ ${{{{
192199
// NvAPI support on DX
193200
// NOTE! To use this feature on HLSL, the shader needs to include 'nvHLSLExtns.h' from the NvAPI SDK
194201
//
202+
203+
// Fp32
204+
195205
__target_intrinsic(hlsl, "($3 = NvInterlockedAddFp32($0, $1, $2))")
196206
__target_intrinsic(cuda, "(*$3 = atomicAdd((float*)$0._getPtrAt($1), $2))")
197207
void InterlockedAddFp32(uint byteAddress, float valueToAdd, out float originalValue);
@@ -203,6 +213,8 @@ ${{{{
203213
originalValue = __atomicAdd(buf[byteAddress / 4], valueToAdd);
204214
}
205215

216+
// Without returning original value
217+
206218
__target_intrinsic(hlsl, "(NvInterlockedAddFp32($0, $1, $2))")
207219
__target_intrinsic(cuda, "atomicAdd((float*)$0._getPtrAt($1), $2)")
208220
void InterlockedAddFp32(uint byteAddress, float valueToAdd);
@@ -214,6 +226,50 @@ ${{{{
214226
__atomicAdd(buf[byteAddress / 4], valueToAdd);
215227
}
216228

229+
// Int64
230+
__cuda_sm_version(6.0)
231+
__target_intrinsic(cuda, "(*$3 = atomicAdd((uint64_t*)$0._getPtrAt($1), $2))")
232+
void InterlockedAddI64(uint byteAddress, int64_t valueToAdd, out int64_t originalValue);
233+
234+
__specialized_for_target(hlsl)
235+
void InterlockedAddI64(uint byteAddress, int64_t inValueToAdd, out int64_t outOriginalValue)
236+
{
237+
uint2 valueToAdd;
238+
valueToAdd.x = uint(inValueToAdd);
239+
valueToAdd.y = uint(uint64_t(inValueToAdd) >> 32);
240+
241+
const uint2 originalValue = __atomicAdd(this, byteAddress, valueToAdd);
242+
outOriginalValue = (int64_t(originalValue.y) << 32) | originalValue.x;
243+
}
244+
245+
__specialized_for_target(glsl)
246+
void InterlockedAddI64(uint byteAddress, int64_t valueToAdd, out int64_t originalValue)
247+
{
248+
RWStructuredBuffer<int64_t> buf = __getEquivalentStructuredBuffer<int64_t>(this);
249+
originalValue = __atomicAdd(buf[byteAddress / 8], valueToAdd);
250+
}
251+
252+
// Without returning original value
253+
__cuda_sm_version(6.0)
254+
__target_intrinsic(cuda, "atomicAdd((uint64_t*)$0._getPtrAt($1), $2)")
255+
void InterlockedAddI64(uint byteAddress, int64_t valueToAdd);
256+
257+
__specialized_for_target(hlsl)
258+
void InterlockedAddI64(uint byteAddress, int64_t inValueToAdd)
259+
{
260+
uint2 valueToAdd;
261+
valueToAdd.x = uint(inValueToAdd);
262+
valueToAdd.y = uint(uint64_t(inValueToAdd) >> 32);
263+
__atomicAdd(this, byteAddress, valueToAdd);
264+
}
265+
266+
__specialized_for_target(glsl)
267+
void InterlockedAddI64(uint byteAddress, int64_t valueToAdd)
268+
{
269+
RWStructuredBuffer<int64_t> buf = __getEquivalentStructuredBuffer<int64_t>(this);
270+
__atomicAdd(buf[byteAddress / 8], valueToAdd);
271+
}
272+
217273
${{{{
218274
}
219275
}}}}

tests/slang-extension/atomic-float-byte-address-buffer.slang

+1-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
// Disabled because requires nvapi to work
88
// Note for this feature we require dxc and we can force that with -use-dxil
99
//DISABLE_TEST(compute):COMPARE_COMPUTE_EX:-d3d12 -compute -use-dxil
10-
//TEST(compute):COMPARE_COMPUTE_EX:-cuda -compute -use-dxil
10+
//TEST(compute):COMPARE_COMPUTE_EX:-cuda -compute
1111

1212
//TEST_INPUT:ubuffer(data=[0.1 0.2 0.3 0.4]):out,name=outputBuffer
1313
RWByteAddressBuffer outputBuffer;
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
// No atomic support on CPU
2+
//DISABLE_TEST(compute):COMPARE_COMPUTE_EX:-cpu -compute
3+
// No support for int64_t on DX11
4+
//DISABLE_TEST(compute):COMPARE_COMPUTE_EX:-slang -compute
5+
// No support for int64_t on fxc - we need SM6.0 and dxil
6+
// https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/hlsl-shader-model-6-0-features-for-direct3d-12
7+
//DISABLE_TEST(compute):COMPARE_COMPUTE_EX:-slang -compute -dx12
8+
// Disable for now, because can only test when NVAPI is available, and it is not by default.
9+
//DISABLE_TEST(compute):COMPARE_COMPUTE_EX:-slang -compute -dx12 -profile cs_6_0 -use-dxil
10+
//TEST(compute, vulkan):COMPARE_COMPUTE_EX:-vk -compute
11+
//TEST(compute):COMPARE_COMPUTE_EX:-cuda -compute
12+
13+
//TEST_INPUT:ubuffer(data=[0 1 2 3 4 5 6 7]):out,name=outputBuffer
14+
RWByteAddressBuffer outputBuffer;
15+
16+
[numthreads(16, 1, 1)]
17+
void computeMain(uint3 dispatchThreadID : SV_DispatchThreadID)
18+
{
19+
uint tid = dispatchThreadID.x;
20+
int idx = (tid & 3) ^ (tid >> 2);
21+
22+
int64_t previousValue = 0;
23+
outputBuffer.InterlockedAddI64((idx << 3), 1, previousValue);
24+
25+
int anotherIdx = tid >> 2;
26+
outputBuffer.InterlockedAddI64(anotherIdx << 3, 3);
27+
}
28+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
10
2+
1
3+
12
4+
3
5+
14
6+
5
7+
16
8+
7

0 commit comments

Comments
 (0)