Skip to content

Commit 12bcc03

Browse files
authored
CUDA half RWTexture write support/doc improvements (shader-slang#1839)
* #include an absolute path didn't work - because paths were taken to always be relative. * Fix for writing to RWTexture with half types on CUDA. * CUDA half functionality doc updates.
1 parent a2725fd commit 12bcc03

5 files changed

+36
-9
lines changed

docs/cuda-target.md

+19
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,25 @@ void setDownstreamCompilerPrelude(SlangPassThrough passThrough, const char* prel
196196

197197
The code that sets up the prelude for the test infrastucture and command line usage can be found in ```TestToolUtil::setSessionDefaultPrelude```. Essentially this determines what the absolute path is to `slang-cpp-prelude.h` is and then just makes the prelude `#include "the absolute path"`.
198198

199+
Half Support
200+
============
201+
202+
Slang supports the half/float16 types on CUDA. To do so NVRTC must have access to the `cuda_fp16.h` and `cuda_fp16.hpp` files that are typically distributed as part of the CUDA SDK. When Slang detects the use of half in source, it will define `SLANG_CUDA_ENABLE_HALF` when `slang-cuda-prelude.h` is included. This will in turn try to include `cuda_fp16.h` and enable extra functionality within the prelude for half support.
203+
204+
Slang tries several mechanisms to locate `cuda_fp16.h` when NVRTC is initiated. The first mechanism is to look in the include paths that are passed to Slang. If `cuda_fp16.h` can be found in one of these paths, no more searching will be performed.
205+
206+
If this fails, the path where NVRTC is located will be searched. In that path "include" and "CUDA/include" paths will be searched. This is probably most suitable for Windows based targets, where NVRTC dll is placed along with other binaries. The "CUDA/include" path is used to try and make clear in this scenario what the contained files are for.
207+
208+
If this fails Slang will look for the CUDA_PATH environmental variable, as is typically set during a CUDA SDK installation.
209+
210+
If this fails - the prelude include of `cuda_fp16.h` will most likely fail on NVRTC invocation.
211+
212+
CUDA has the `__half` and `__half2` types defined in `cuda_fp16.h`. The `__half2` can produce results just as quickly as doing the same operation on `__half` - in essence for some operations `__half2` is [SIMD](https://en.wikipedia.org/wiki/SIMD) like. The half implementation in Slang tries to take advantage of this optimization.
213+
214+
Since Slang supports up to 4 wide vectors Slang has to build on CUDAs half support. The types _`_half3` and `__half4` are implemented in `slang-cuda-prelude.h` for this reason. It is worth noting that `__half3` is made up of a `__half2` and a `__half`. As `__half2` is 4 byte aligned, this means `__half3` is actually 8 bytes, rather than 6 bytes that might be expected.
215+
216+
One area where this optimization isn't fully used is in comparisons - as in effect Slang treats all the vector/matrix half comparisons as if they are scalar. This could be perhaps be improved on in the future. Doing so would require using features that are not directly available in the CUDA headers.
217+
199218
Wave Intrinsics
200219
===============
201220

docs/target-compatibility.md

+3-1
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ Items with ^ means there is some discussion about support later in the document
1010

1111
| Feature | D3D11 | D3D12 | VK | CUDA | CPU
1212
|-----------------------------|--------------|--------------|------------|---------------|---------------
13-
| Half Type | No | Yes | Yes | No + | No +
13+
| Half Type | No | Yes | Yes | Yes ^ | No +
1414
| Double Type | Yes | Yes | Yes | Yes | Yes
1515
| Double Intrinsics | No | Limited + | Limited | Most | Yes
1616
| u/int64_t Type | No | Yes ^ | Yes | Yes | Yes
@@ -46,6 +46,8 @@ Items with ^ means there is some discussion about support later in the document
4646

4747
There appears to be a problem writing to a StructuredBuffer containing half on D3D12. D3D12 also appears to have problems doing calculations with half.
4848

49+
In order for half to work in CUDA, NVRTC must be able to include `cuda_fp16.h` and related files. Please read the [CUDA target documentation](cuda-target.md) for more details.
50+
4951
## u/int64_t Type
5052

5153
Requires SM6.0 which requires DXIL for D3D12. Therefore not available with DXBC on D3D11 or D3D12.

prelude/slang-cuda-prelude.h

+9-3
Original file line numberDiff line numberDiff line change
@@ -306,6 +306,15 @@ SLANG_FORCE_INLINE SLANG_CUDA_CALL ushort2 __half_as_ushort(const __half2& i) {
306306
SLANG_FORCE_INLINE SLANG_CUDA_CALL ushort3 __half_as_ushort(const __half3& i) { return make_ushort3(__half_as_ushort(i.xy.x), __half_as_ushort(i.xy.y), __half_as_ushort(i.z)); }
307307
SLANG_FORCE_INLINE SLANG_CUDA_CALL ushort4 __half_as_ushort(const __half4& i) { return make_ushort4(__half_as_ushort(i.xy.x), __half_as_ushort(i.xy.y), __half_as_ushort(i.zw.x), __half_as_ushort(i.zw.y)); }
308308

309+
// This is a little bit of a hack. Fortunately CUDA has the definitions of the templated types in
310+
// include/surface_indirect_functions.h
311+
// Here we find the template definition requires a specialization of __nv_isurf_trait to allow
312+
// a specialization of the surface write functions.
313+
// This *isn't* a problem on the read functions as they don't have a return type that uses this mechanism
314+
315+
template<> struct __nv_isurf_trait<__half> { typedef void type; };
316+
template<> struct __nv_isurf_trait<__half2> { typedef void type; };
317+
template<> struct __nv_isurf_trait<__half4> { typedef void type; };
309318

310319
#define SLANG_DROP_PARENS(...) __VA_ARGS__
311320

@@ -336,8 +345,6 @@ SLANG_SURFACE_READ(surf2DLayeredread, (int x, int y, int layer), (x, y, layer))
336345
SLANG_SURFACE_READ(surfCubemapread, (int x, int y, int face), (x, y, face))
337346
SLANG_SURFACE_READ(surfCubemapLayeredread, (int x, int y, int layerFace), (x, y, layerFace))
338347

339-
// The following doesn't quite work, for reasons currently not determined
340-
#if 0
341348
#define SLANG_SURFACE_WRITE(FUNC_NAME, TYPE_ARGS, ARGS) \
342349
template <> \
343350
SLANG_FORCE_INLINE SLANG_CUDA_CALL void FUNC_NAME<__half>(__half data, cudaSurfaceObject_t surfObj, SLANG_DROP_PARENS TYPE_ARGS, cudaSurfaceBoundaryMode boundaryMode) \
@@ -364,7 +371,6 @@ SLANG_SURFACE_WRITE(surf1DLayeredwrite, (int x, int layer), (x, layer))
364371
SLANG_SURFACE_WRITE(surf2DLayeredwrite, (int x, int y, int layer), (x, y, layer))
365372
SLANG_SURFACE_WRITE(surfCubemapwrite, (int x, int y, int face), (x, y, face))
366373
SLANG_SURFACE_WRITE(surfCubemapLayeredwrite, (int x, int y, int layerFace), (x, y, layerFace))
367-
#endif
368374

369375
#endif
370376

tests/compute/half-rw-texture-simple.slang

+2-2
Original file line numberDiff line numberDiff line change
@@ -41,10 +41,10 @@ void computeMain(uint3 dispatchThreadID : SV_DispatchThreadID)
4141
// NOTE! This is disabled because on CUDA, whilst this has an effect it is not what is expected.
4242
// The value read back has changed but seems to always be 1.
4343
// rwt1D[idx] = idx;
44-
//rwt2D[uint2(idx, idx)] = half(idx);
44+
rwt2D[uint2(idx, idx)] = half(idx);
4545

4646
//val += rwt1D[idx];
47-
//val += rwt2D[uint2(idx, idx)];
47+
val += rwt2D[uint2(idx, idx)];
4848
//val += rwt3D[uint3(idx, idx, idx)];
4949

5050
outputBuffer[idx] = val;
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
type: float
22
5.000000
3-
5.000000
4-
5.000000
5-
5.000000
3+
6.000000
4+
7.000000
5+
8.000000

0 commit comments

Comments
 (0)