WIP on RWTexture types on CUDA/CPU (shader-slang#1234)

jsmall-zzz · web-flow · commit 1f401d04e32c · 2020-02-20T15:24:00.000-08:00
* CUDA support for array of resources.

* * Add support for Texture2DArray on CPU
* Expand texture-simple.slang to test Texture2DArray

* Reorganise CUDAComputeUtil to split out createTextureResource.

* Add TextureCubeArray support for CPU/CUDA targets.

* Pulled out CUDAResource
Renamed derived classes to reflect that change.

* Creation of SurfObject type.

* Functions to return read/write access for simplifying future additions.

* WIP for RWTexture access on CPU/CUDA.

* CUsurfObject cannot have mips.

* Ability to set number of mips on test data.
Preliminary support for CUsurfObj and RWTexture1D on CUDA.
CUDA docs improvements.

* Fix typo.
diff --git a/docs/cuda-target.md b/docs/cuda-target.md
@@ -16,13 +16,14 @@ These limitations apply to Slang transpiling to CUDA.
 
 * Only supports the 'texture object' style binding (The texture object API is only supported on devices of compute capability 3.0 or higher. )
 * Samplers are not separate objects in CUDA - they are combined into a single 'TextureObject'. So samplers are effectively ignored on CUDA targets. 
-* Whilst there is tex1Dfetch there are no equivalents for higher dimensions - so such accesses are not currently supported
 * When using a TextureArray (layered texture in CUDA) - the index will be treated as an int, as this is all CUDA allows
 * Care must be used in using `WaveGetLaneIndex` wave intrinsic - it will only give the right results for appopriate launches
+* Surfaces are used for textures which are read/write. CUDA does NOT do format conversion with surfaces.
 
-The following are a work in progress or not implmented but are planned to be so in the future
+The following are a work in progress or not implemented but are planned to be so in the future
 
-* Resource types including surfaces
+* Some resource types remain unsupported, and not all methods on types are supported
+* Some support for Wave intrinsics
 
 # How it works
 
@@ -96,6 +97,30 @@ The UniformState and UniformEntryPointParams struct typically vary by shader. Un
     size_t sizeInBytes;
 ```  
 
+## Texture
+
+Read only textures will be bound as the opaque CUDA type CUtexObject. This type is the combination of both a texture AND a sampler. This is somewhat different from HLSL, where there can be separate `SamplerState` variables. This allows access of a single texture binding with different types of sampling. 
+
+If code relys on this behavior it will be necessary to bind multiple CtexObjects with different sampler settings, accessing the same texture data. 
+
+Slang has some preliminary support for TextureSampler type - a combined Texture and SamplerState. To write Slang code that can target CUDA and other platforms using this mechanism will expose the semantics appropriately within the source.  
+ 
+Load is only supported for Texture1D, and the mip map selection argument is ignored. This is because there is tex1Dfetch and no higher dimensional equivalents. CUDA also only allows such access if the backing array is linear memory - meaning the bound texture cannot have mip maps - thus making the mip map parameter superflous anyway. RWTexture does allow Load on other texture types.  
+ 
+## RWTexture 
+ 
+RWTexture types are converted into CUsurfObject type. 
+
+In CUDA it is not possible to do a format conversion on an access to a CUsurfObject, so it must be backed by the same data format as is used within the Slang source code. 
+
+It is also worth noting that CUsurfObjects in CUDA are NOT allowed to have mip maps. 
+
+By default surface access uses cudaBoundaryModeZero, this can be replaced using the macro SLANG_CUDA_BOUNDARY_MODE in the CUDA prelude.
+
+## Sampler
+
+Samplers are in effect ignored in CUDA output. Currently we do output a variable `SamplerState`, but this value is never accessed within the kernel and so can be ignored. More discussion on this behavior is in `Texture` section.
+
 ## Unsized arrays
 
 Unsized arrays can be used, which are indicated by an array with no size as in `[]`. For example 
diff --git a/prelude/slang-cpp-types.h b/prelude/slang-cpp-types.h
@@ -343,6 +343,21 @@ struct TextureCubeArray
     ITextureCubeArray* texture;              
 };
 
+/* !!!!!!!!!!!!!!!!!!!!!!!!!!! RWTexture !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! */
+
+struct IRWTexture1D
+{
+    virtual void Load(int32_t loc, void* out) = 0;
+};
+
+template <typename T>
+struct RWTexture1D
+{
+    T Load(int32_t loc) const { T out; texture->Load(loc, &out); return out; }
+    
+    IRWTexture1D* texture;              
+};
+
 /* Varying input for Compute */
 
 /* Used when running a single thread */
diff --git a/prelude/slang-cuda-prelude.h b/prelude/slang-cuda-prelude.h
@@ -38,6 +38,16 @@
 // Here we don't have the index zeroing behavior, as such bounds checks are generally not on GPU targets either. 
 #ifndef SLANG_CUDA_FIXED_ARRAY_BOUND_CHECK
 #   define SLANG_CUDA_FIXED_ARRAY_BOUND_CHECK(index, count) SLANG_PRELUDE_ASSERT(index < count); 
+#endif
+
+ // This macro handles how out-of-range surface coordinates are handled; 
+ // I can equal
+ // cudaBoundaryModeClamp, in which case out-of-range coordinates are clamped to the valid range
+ // cudaBoundaryModeZero, in which case out-of-range reads return zero and out-of-range writes are ignored
+ // cudaBoundaryModeTrap, in which case out-of-range accesses cause the kernel execution to fail. 
+ 
+#ifndef SLANG_CUDA_BOUNDARY_MODE
+#   define SLANG_CUDA_BOUNDARY_MODE cudaBoundaryModeZero
 #endif
 
 template <typename T, size_t SIZE>
diff --git a/source/slang/core.meta.slang b/source/slang/core.meta.slang
@@ -777,6 +777,67 @@ for (int tt = 0; tt < kBaseTextureTypeCount; ++tt)
                     sb << ")$z\")\n";
 
                 }
+
+                // CUDA
+                if (isMultisample)
+                {
+                }
+                else
+                {
+                    if (access == SLANG_RESOURCE_ACCESS_READ_WRITE)
+                    {
+                        const int coordCount = kBaseTextureTypes[tt].coordCount;
+                        const int vecCount = coordCount + int(isArray);
+
+                        if( baseShape != TextureFlavor::Shape::ShapeCube )
+                        {
+                            sb << "__target_intrinsic(cuda, \"surf" << coordCount << "D";
+                            if (isArray)
+                            {
+                                sb << "Layered";
+                            }
+                            sb << "read";
+                            sb << "<$T0>($0";
+                            for (int i = 0; i < coordCount; ++i)
+                            {
+                                sb << ", ($1)";
+                                if (vecCount > 1)
+                                {
+                                    sb << '.' << char(i + 'x');
+                                }
+                            }
+                            if (isArray)
+                            {
+                                sb << ", int(($1)." << char(coordCount + 'x') << ")";
+                            }
+                            sb << ", SLANG_CUDA_BOUNDARY_MODE)\")\n";
+                        }
+                        else
+                        {
+                            sb << "__target_intrinsic(cuda, \"surfCubemap";
+                            if (isArray)
+                            {
+                                sb << "Layered";
+                            }
+                            sb << "read";
+                            sb << "<$T0>($0, ($1).x, ($1).y, ($1).z"; 
+                            if (isArray)
+                            {
+                                sb << ", int(($1).w)";
+                            }
+                            sb << ", SLANG_CUDA_BOUNDARY_MODE)\")\n";
+                        }
+                    }
+                    else if (access == SLANG_RESOURCE_ACCESS_READ)
+                    {
+                        // We can allow this on Texture1D
+                        if( baseShape == TextureFlavor::Shape::Shape1D && isArray == false)
+                        {
+                            sb << "__target_intrinsic(cuda, \"tex1Dfetch<$T0>($0, ($1).x)\")\n";
+                        }
+                    }
+                }
+
                 sb << "T Load(";
                 sb << "int" << loadCoordCount << " location";
                 if(isMultisample)
@@ -785,6 +846,7 @@ for (int tt = 0; tt < kBaseTextureTypeCount; ++tt)
                 }
                 sb << ");\n";
 
+                // GLSL
                 if (isMultisample)
                 {
                     sb << "__glsl_extension(GL_EXT_samplerless_texture_functions)";
@@ -804,6 +866,9 @@ for (int tt = 0; tt < kBaseTextureTypeCount; ++tt)
                     }
                     sb << ", $2)$z\")\n";
                 }
+
+
+
                 sb << "T Load(";
                 sb << "int" << loadCoordCount << " location";
                 if(isMultisample)
diff --git a/source/slang/core.meta.slang.h b/source/slang/core.meta.slang.h
@@ -798,6 +798,67 @@ for (int tt = 0; tt < kBaseTextureTypeCount; ++tt)
                     sb << ")$z\")\n";
 
                 }
+
+                // CUDA
+                if (isMultisample)
+                {
+                }
+                else
+                {
+                    if (access == SLANG_RESOURCE_ACCESS_READ_WRITE)
+                    {
+                        const int coordCount = kBaseTextureTypes[tt].coordCount;
+                        const int vecCount = coordCount + int(isArray);
+
+                        if( baseShape != TextureFlavor::Shape::ShapeCube )
+                        {
+                            sb << "__target_intrinsic(cuda, \"surf" << coordCount << "D";
+                            if (isArray)
+                            {
+                                sb << "Layered";
+                            }
+                            sb << "read";
+                            sb << "<$T0>($0";
+                            for (int i = 0; i < coordCount; ++i)
+                            {
+                                sb << ", ($1)";
+                                if (vecCount > 1)
+                                {
+                                    sb << '.' << char(i + 'x');
+                                }
+                            }
+                            if (isArray)
+                            {
+                                sb << ", int(($1)." << char(coordCount + 'x') << ")";
+                            }
+                            sb << ", SLANG_CUDA_BOUNDARY_MODE)\")\n";
+                        }
+                        else
+                        {
+                            sb << "__target_intrinsic(cuda, \"surfCubemap";
+                            if (isArray)
+                            {
+                                sb << "Layered";
+                            }
+                            sb << "read";
+                            sb << "<$T0>($0, ($1).x, ($1).y, ($1).z"; 
+                            if (isArray)
+                            {
+                                sb << ", int(($1).w)";
+                            }
+                            sb << ", SLANG_CUDA_BOUNDARY_MODE)\")\n";
+                        }
+                    }
+                    else if (access == SLANG_RESOURCE_ACCESS_READ)
+                    {
+                        // We can allow this on Texture1D
+                        if( baseShape == TextureFlavor::Shape::Shape1D && isArray == false)
+                        {
+                            sb << "__target_intrinsic(cuda, \"tex1Dfetch<$T0>($0, ($1).x)\")\n";
+                        }
+                    }
+                }
+
                 sb << "T Load(";
                 sb << "int" << loadCoordCount << " location";
                 if(isMultisample)
@@ -806,6 +867,7 @@ for (int tt = 0; tt < kBaseTextureTypeCount; ++tt)
                 }
                 sb << ");\n";
 
+                // GLSL
                 if (isMultisample)
                 {
                     sb << "__glsl_extension(GL_EXT_samplerless_texture_functions)";
@@ -825,6 +887,9 @@ for (int tt = 0; tt < kBaseTextureTypeCount; ++tt)
                     }
                     sb << ", $2)$z\")\n";
                 }
+
+
+
                 sb << "T Load(";
                 sb << "int" << loadCoordCount << " location";
                 if(isMultisample)
@@ -1359,7 +1424,7 @@ for (auto op : binaryOps)
         sb << "__intrinsic_op(" << int(op.opCode) << ") matrix<" << resultType << ",N,M> operator" << op.opName << "(" << leftQual << "matrix<" << leftType << ",N,M> left, " << rightType << " right);\n";
     }
 }
-SLANG_RAW("#line 1341 \"core.meta.slang\"")
+SLANG_RAW("#line 1406 \"core.meta.slang\"")
 SLANG_RAW("\n")
 SLANG_RAW("\n")
 SLANG_RAW("// Specialized function\n")
diff --git a/source/slang/hlsl.meta.slang b/source/slang/hlsl.meta.slang
@@ -1433,7 +1433,7 @@ __generic<T : __BuiltinType, let N : int, let M : int> uint4 WaveMatch(matrix<T,
 
 // TODO(JS): For CUDA the article claims mask has to be used carefully
 // https://devblogs.nvidia.com/using-cuda-warp-level-primitives/
-// With the Warp intrinsics there is though mask, and it's just the 'active lanes'. So __activemask()
+// With the Warp intrinsics there is no mask, and it's just the 'active lanes'. So __activemask()
 // seems to be appropriate.
 
 __target_intrinsic(cuda, "(__all_sync(__activemask(), $0) != 0)") 
diff --git a/source/slang/hlsl.meta.slang.h b/source/slang/hlsl.meta.slang.h
@@ -1509,7 +1509,7 @@ SLANG_RAW("__generic<T : __BuiltinType, let N : int, let M : int> uint4 WaveMatc
 SLANG_RAW("\n")
 SLANG_RAW("// TODO(JS): For CUDA the article claims mask has to be used carefully\n")
 SLANG_RAW("// https://devblogs.nvidia.com/using-cuda-warp-level-primitives/\n")
-SLANG_RAW("// With the Warp intrinsics there is though mask, and it's just the 'active lanes'. So __activemask()\n")
+SLANG_RAW("// With the Warp intrinsics there is no mask, and it's just the 'active lanes'. So __activemask()\n")
 SLANG_RAW("// seems to be appropriate.\n")
 SLANG_RAW("\n")
 SLANG_RAW("__target_intrinsic(cuda, \"(__all_sync(__activemask(), $0) != 0)\") \n")
diff --git a/tests/compute/rw-texture-simple.slang b/tests/compute/rw-texture-simple.slang
@@ -0,0 +1,27 @@
+//TEST(compute):COMPARE_COMPUTE_EX:-cpu -compute 
+// Doesn't work on DX11 currently - locks up on binding
+//DISABLE_TEST(compute):COMPARE_COMPUTE_EX:-slang -compute
+//TEST(compute):COMPARE_COMPUTE_EX:-slang -compute -dx12 
+//TEST(compute):COMPARE_COMPUTE_EX:-slang -compute -dx12 -profile cs_6_0 -use-dxil
+// TODO(JS): Doesn't work on vk currently, because createTextureView not implemented on vk renderer
+//DISABLE_TEST(compute, vulkan):COMPARE_COMPUTE_EX:-vk -compute
+//TEST(compute):COMPARE_COMPUTE_EX:-cuda -compute 
+
+//TEST_INPUT: RWTexture1D(format=R_Float32, size=4, content = one):name rwt1D
+RWTexture1D<float> rwt1D;
+
+//TEST_INPUT: ubuffer(data=[0 0 0 0], stride=4):out,name outputBuffer
+RWStructuredBuffer<float> outputBuffer;
+
+[numthreads(4, 4, 1)]
+void computeMain(uint3 dispatchThreadID : SV_DispatchThreadID)
+{
+    int idx = dispatchThreadID.x;
+    float u = idx * (1.0f / 4);
+    
+    float val = 0.0f;
+ 
+    val += rwt1D.Load(idx);
+ 
+    outputBuffer[idx] = val;
+}
diff --git a/tests/compute/rw-texture-simple.slang.expected.txt b/tests/compute/rw-texture-simple.slang.expected.txt
@@ -0,0 +1,4 @@
+3F800000
+3F800000
+3F800000
+3F800000
diff --git a/tests/compute/texture-simple.slang b/tests/compute/texture-simple.slang
@@ -6,6 +6,10 @@
 //DISABLE_TEST(compute, vulkan):COMPARE_COMPUTE_EX:-vk -compute
 //TEST(compute):COMPARE_COMPUTE_EX:-cuda -compute 
 
+// Doesn't work on CUDA, not clear why yet
+//DISABLE_TEST_INPUT: Texture1D(format=R_Float32, size=4, content = one, mipMaps=1):name tLoad1D
+//Texture1D<float> tLoad1D;
+
 //TEST_INPUT: Texture1D(size=4, content = one):name t1D
 Texture1D<float> t1D;
 //TEST_INPUT: Texture2D(size=4, content = one):name t2D
@@ -35,6 +39,7 @@ void computeMain(uint3 dispatchThreadID : SV_DispatchThreadID)
     float u = idx * (1.0f / 4);
     
     float val = 0.0f;
+   
     val += t1D.SampleLevel(samplerState, u, 0);
     val += t2D.SampleLevel(samplerState, float2(u, u), 0);
     val += t3D.SampleLevel(samplerState, float3(u, u, u), 0);
@@ -44,5 +49,7 @@ void computeMain(uint3 dispatchThreadID : SV_DispatchThreadID)
     val += t2DArray.SampleLevel(samplerState, float3(u, u, 0), 0);
     val += tCubeArray.SampleLevel(samplerState, float4(u, u, u, 0), 0);
  
+    //val += tLoad1D.Load(int2(idx, 0));
+ 
     outputBuffer[idx] = val;
 }
diff --git a/tools/render-test/cpu-compute-util.cpp b/tools/render-test/cpu-compute-util.cpp
diff --git a/tools/render-test/cuda/cuda-compute-util.cpp b/tools/render-test/cuda/cuda-compute-util.cpp
diff --git a/tools/render-test/cuda/cuda-compute-util.h b/tools/render-test/cuda/cuda-compute-util.h
diff --git a/tools/render-test/shader-input-layout.cpp b/tools/render-test/shader-input-layout.cpp
diff --git a/tools/render-test/shader-input-layout.h b/tools/render-test/shader-input-layout.h