CUDA half RWTexture write support/doc improvements (shader-slang#1839)

jsmall-zzz · web-flow · commit 12bcc039c2a2 · 2021-05-14T16:59:35.000-04:00
* #include an absolute path didn't work - because paths were taken to always be relative.

* Fix for writing to RWTexture with half types on CUDA.

* CUDA half functionality doc updates.
diff --git a/docs/cuda-target.md b/docs/cuda-target.md
@@ -196,6 +196,25 @@ void setDownstreamCompilerPrelude(SlangPassThrough passThrough, const char* prel
 
 The code that sets up the prelude for the test infrastucture and command line usage can be found in ```TestToolUtil::setSessionDefaultPrelude```. Essentially this determines what the absolute path is to `slang-cpp-prelude.h` is and then just makes the prelude `#include "the absolute path"`.
 
+Half Support
+============
+
+Slang supports the half/float16 types on CUDA. To do so NVRTC must have access to the `cuda_fp16.h` and `cuda_fp16.hpp` files that are typically distributed as part of the CUDA SDK. When Slang detects the use of half in source, it will define `SLANG_CUDA_ENABLE_HALF` when `slang-cuda-prelude.h` is included. This will in turn try to include `cuda_fp16.h` and enable extra functionality within the prelude for half support. 
+
+Slang tries several mechanisms to locate `cuda_fp16.h` when NVRTC is initiated. The first mechanism is to look in the include paths that are passed to Slang. If `cuda_fp16.h` can be found in one of these paths, no more searching will be performed. 
+
+If this fails, the path where NVRTC is located will be searched. In that path "include" and "CUDA/include" paths will be searched. This is probably most suitable for Windows based targets, where NVRTC dll is placed along with other binaries. The "CUDA/include" path is used to try and make clear in this scenario what the contained files are for. 
+
+If this fails Slang will look for the CUDA_PATH environmental variable, as is typically set during a CUDA SDK installation. 
+
+If this fails - the prelude include of `cuda_fp16.h` will most likely fail on NVRTC invocation.
+
+CUDA has the `__half` and `__half2` types defined in `cuda_fp16.h`. The `__half2` can produce results just as quickly as doing the same operation on `__half` - in essence for some operations `__half2` is [SIMD](https://en.wikipedia.org/wiki/SIMD) like. The half implementation in Slang tries to take advantage of this optimization. 
+
+Since Slang supports up to 4 wide vectors Slang has to build on CUDAs half support. The types _`_half3` and `__half4` are implemented in `slang-cuda-prelude.h` for this reason. It is worth noting that `__half3` is made up of a `__half2` and a `__half`. As `__half2` is 4 byte aligned, this means `__half3` is actually 8 bytes, rather than 6 bytes that might be expected.
+
+One area where this optimization isn't fully used is in comparisons - as in effect Slang treats all the vector/matrix half comparisons as if they are scalar. This could be perhaps be improved on in the future. Doing so would require using features that are not directly available in the CUDA headers. 
+
 Wave Intrinsics
 ===============
 
diff --git a/docs/target-compatibility.md b/docs/target-compatibility.md
@@ -10,7 +10,7 @@ Items with ^ means there is some discussion about support later in the document
 
 | Feature                     |    D3D11     |    D3D12     |     VK     |      CUDA     |    CPU
 |-----------------------------|--------------|--------------|------------|---------------|---------------
-| Half Type                   |     No       |     Yes      |   Yes      |     No +      |    No +
+| Half Type                   |     No       |     Yes      |   Yes      |     Yes ^     |    No +
 | Double Type                 |     Yes      |     Yes      |   Yes      |     Yes       |    Yes
 | Double Intrinsics           |     No       |   Limited +  |  Limited   |     Most      |    Yes
 | u/int64_t Type              |     No       |   Yes ^      |   Yes      |     Yes       |    Yes
@@ -46,6 +46,8 @@ Items with ^ means there is some discussion about support later in the document
 
 There appears to be a problem writing to a StructuredBuffer containing half on D3D12. D3D12 also appears to have problems doing calculations with half.
 
+In order for half to work in CUDA, NVRTC must be able to include `cuda_fp16.h` and related files. Please read the [CUDA target documentation](cuda-target.md) for more details. 
+
 ## u/int64_t Type
 
 Requires SM6.0 which requires DXIL for D3D12. Therefore not available with DXBC on D3D11 or D3D12.
diff --git a/prelude/slang-cuda-prelude.h b/prelude/slang-cuda-prelude.h
@@ -306,6 +306,15 @@ SLANG_FORCE_INLINE SLANG_CUDA_CALL ushort2 __half_as_ushort(const __half2& i) {
 SLANG_FORCE_INLINE SLANG_CUDA_CALL ushort3 __half_as_ushort(const __half3& i) { return make_ushort3(__half_as_ushort(i.xy.x), __half_as_ushort(i.xy.y), __half_as_ushort(i.z)); }
 SLANG_FORCE_INLINE SLANG_CUDA_CALL ushort4 __half_as_ushort(const __half4& i) { return make_ushort4(__half_as_ushort(i.xy.x), __half_as_ushort(i.xy.y), __half_as_ushort(i.zw.x), __half_as_ushort(i.zw.y)); }
 
+// This is a little bit of a hack. Fortunately CUDA has the definitions of the templated types in 
+// include/surface_indirect_functions.h
+// Here we find the template definition requires a specialization of __nv_isurf_trait to allow 
+// a specialization of the surface write functions. 
+// This *isn't* a problem on the read functions as they don't have a return type that uses this mechanism 
+
+template<> struct __nv_isurf_trait<__half> { typedef void type; };
+template<> struct __nv_isurf_trait<__half2> { typedef void type; };
+template<> struct __nv_isurf_trait<__half4> { typedef void type; };
 
 #define SLANG_DROP_PARENS(...) __VA_ARGS__
 
@@ -336,8 +345,6 @@ SLANG_SURFACE_READ(surf2DLayeredread, (int x, int y, int layer), (x, y, layer))
 SLANG_SURFACE_READ(surfCubemapread, (int x, int y, int face), (x, y, face))
 SLANG_SURFACE_READ(surfCubemapLayeredread, (int x, int y, int layerFace), (x, y, layerFace))
 
-// The following doesn't quite work, for reasons currently not determined
-#if 0
 #define SLANG_SURFACE_WRITE(FUNC_NAME, TYPE_ARGS, ARGS) \
 template <> \
 SLANG_FORCE_INLINE SLANG_CUDA_CALL void FUNC_NAME<__half>(__half data, cudaSurfaceObject_t surfObj, SLANG_DROP_PARENS TYPE_ARGS, cudaSurfaceBoundaryMode boundaryMode) \
@@ -364,7 +371,6 @@ SLANG_SURFACE_WRITE(surf1DLayeredwrite, (int x, int layer), (x, layer))
 SLANG_SURFACE_WRITE(surf2DLayeredwrite, (int x, int y, int layer), (x, y, layer))
 SLANG_SURFACE_WRITE(surfCubemapwrite, (int x, int y, int face), (x, y, face))
 SLANG_SURFACE_WRITE(surfCubemapLayeredwrite, (int x, int y, int layerFace), (x, y, layerFace))
-#endif     
      
 #endif
 
diff --git a/tests/compute/half-rw-texture-simple.slang b/tests/compute/half-rw-texture-simple.slang
@@ -41,10 +41,10 @@ void computeMain(uint3 dispatchThreadID : SV_DispatchThreadID)
     // NOTE! This is disabled because on CUDA, whilst this has an effect it is not what is expected.
     // The value read back has changed but seems to always be 1.
     // rwt1D[idx] = idx;
-    //rwt2D[uint2(idx, idx)] = half(idx);    
+    rwt2D[uint2(idx, idx)] = half(idx);    
     
     //val += rwt1D[idx];
-    //val += rwt2D[uint2(idx, idx)];
+    val += rwt2D[uint2(idx, idx)];
     //val += rwt3D[uint3(idx, idx, idx)];
  
     outputBuffer[idx] = val;
diff --git a/tests/compute/half-rw-texture-simple.slang.expected.txt b/tests/compute/half-rw-texture-simple.slang.expected.txt
@@ -1,5 +1,5 @@
 type: float
 5.000000
-5.000000
-5.000000
-5.000000
+6.000000
+7.000000
+8.000000