Skip to content

Commit 1f401d0

Browse files
authored
WIP on RWTexture types on CUDA/CPU (shader-slang#1234)
* CUDA support for array of resources. * * Add support for Texture2DArray on CPU * Expand texture-simple.slang to test Texture2DArray * Reorganise CUDAComputeUtil to split out createTextureResource. * Add TextureCubeArray support for CPU/CUDA targets. * Pulled out CUDAResource Renamed derived classes to reflect that change. * Creation of SurfObject type. * Functions to return read/write access for simplifying future additions. * WIP for RWTexture access on CPU/CUDA. * CUsurfObject cannot have mips. * Ability to set number of mips on test data. Preliminary support for CUsurfObj and RWTexture1D on CUDA. CUDA docs improvements. * Fix typo.
1 parent f9d99fd commit 1f401d0

15 files changed

+397
-67
lines changed

docs/cuda-target.md

+28-3
Original file line numberDiff line numberDiff line change
@@ -16,13 +16,14 @@ These limitations apply to Slang transpiling to CUDA.
1616

1717
* Only supports the 'texture object' style binding (The texture object API is only supported on devices of compute capability 3.0 or higher. )
1818
* Samplers are not separate objects in CUDA - they are combined into a single 'TextureObject'. So samplers are effectively ignored on CUDA targets.
19-
* Whilst there is tex1Dfetch there are no equivalents for higher dimensions - so such accesses are not currently supported
2019
* When using a TextureArray (layered texture in CUDA) - the index will be treated as an int, as this is all CUDA allows
2120
* Care must be used in using `WaveGetLaneIndex` wave intrinsic - it will only give the right results for appopriate launches
21+
* Surfaces are used for textures which are read/write. CUDA does NOT do format conversion with surfaces.
2222

23-
The following are a work in progress or not implmented but are planned to be so in the future
23+
The following are a work in progress or not implemented but are planned to be so in the future
2424

25-
* Resource types including surfaces
25+
* Some resource types remain unsupported, and not all methods on types are supported
26+
* Some support for Wave intrinsics
2627

2728
# How it works
2829

@@ -96,6 +97,30 @@ The UniformState and UniformEntryPointParams struct typically vary by shader. Un
9697
size_t sizeInBytes;
9798
```
9899

100+
## Texture
101+
102+
Read only textures will be bound as the opaque CUDA type CUtexObject. This type is the combination of both a texture AND a sampler. This is somewhat different from HLSL, where there can be separate `SamplerState` variables. This allows access of a single texture binding with different types of sampling.
103+
104+
If code relys on this behavior it will be necessary to bind multiple CtexObjects with different sampler settings, accessing the same texture data.
105+
106+
Slang has some preliminary support for TextureSampler type - a combined Texture and SamplerState. To write Slang code that can target CUDA and other platforms using this mechanism will expose the semantics appropriately within the source.
107+
108+
Load is only supported for Texture1D, and the mip map selection argument is ignored. This is because there is tex1Dfetch and no higher dimensional equivalents. CUDA also only allows such access if the backing array is linear memory - meaning the bound texture cannot have mip maps - thus making the mip map parameter superflous anyway. RWTexture does allow Load on other texture types.
109+
110+
## RWTexture
111+
112+
RWTexture types are converted into CUsurfObject type.
113+
114+
In CUDA it is not possible to do a format conversion on an access to a CUsurfObject, so it must be backed by the same data format as is used within the Slang source code.
115+
116+
It is also worth noting that CUsurfObjects in CUDA are NOT allowed to have mip maps.
117+
118+
By default surface access uses cudaBoundaryModeZero, this can be replaced using the macro SLANG_CUDA_BOUNDARY_MODE in the CUDA prelude.
119+
120+
## Sampler
121+
122+
Samplers are in effect ignored in CUDA output. Currently we do output a variable `SamplerState`, but this value is never accessed within the kernel and so can be ignored. More discussion on this behavior is in `Texture` section.
123+
99124
## Unsized arrays
100125

101126
Unsized arrays can be used, which are indicated by an array with no size as in `[]`. For example

prelude/slang-cpp-types.h

+15
Original file line numberDiff line numberDiff line change
@@ -343,6 +343,21 @@ struct TextureCubeArray
343343
ITextureCubeArray* texture;
344344
};
345345

346+
/* !!!!!!!!!!!!!!!!!!!!!!!!!!! RWTexture !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! */
347+
348+
struct IRWTexture1D
349+
{
350+
virtual void Load(int32_t loc, void* out) = 0;
351+
};
352+
353+
template <typename T>
354+
struct RWTexture1D
355+
{
356+
T Load(int32_t loc) const { T out; texture->Load(loc, &out); return out; }
357+
358+
IRWTexture1D* texture;
359+
};
360+
346361
/* Varying input for Compute */
347362

348363
/* Used when running a single thread */

prelude/slang-cuda-prelude.h

+10
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,16 @@
3838
// Here we don't have the index zeroing behavior, as such bounds checks are generally not on GPU targets either.
3939
#ifndef SLANG_CUDA_FIXED_ARRAY_BOUND_CHECK
4040
# define SLANG_CUDA_FIXED_ARRAY_BOUND_CHECK(index, count) SLANG_PRELUDE_ASSERT(index < count);
41+
#endif
42+
43+
// This macro handles how out-of-range surface coordinates are handled;
44+
// I can equal
45+
// cudaBoundaryModeClamp, in which case out-of-range coordinates are clamped to the valid range
46+
// cudaBoundaryModeZero, in which case out-of-range reads return zero and out-of-range writes are ignored
47+
// cudaBoundaryModeTrap, in which case out-of-range accesses cause the kernel execution to fail.
48+
49+
#ifndef SLANG_CUDA_BOUNDARY_MODE
50+
# define SLANG_CUDA_BOUNDARY_MODE cudaBoundaryModeZero
4151
#endif
4252

4353
template <typename T, size_t SIZE>

source/slang/core.meta.slang

+65
Original file line numberDiff line numberDiff line change
@@ -777,6 +777,67 @@ for (int tt = 0; tt < kBaseTextureTypeCount; ++tt)
777777
sb << ")$z\")\n";
778778

779779
}
780+
781+
// CUDA
782+
if (isMultisample)
783+
{
784+
}
785+
else
786+
{
787+
if (access == SLANG_RESOURCE_ACCESS_READ_WRITE)
788+
{
789+
const int coordCount = kBaseTextureTypes[tt].coordCount;
790+
const int vecCount = coordCount + int(isArray);
791+
792+
if( baseShape != TextureFlavor::Shape::ShapeCube )
793+
{
794+
sb << "__target_intrinsic(cuda, \"surf" << coordCount << "D";
795+
if (isArray)
796+
{
797+
sb << "Layered";
798+
}
799+
sb << "read";
800+
sb << "<$T0>($0";
801+
for (int i = 0; i < coordCount; ++i)
802+
{
803+
sb << ", ($1)";
804+
if (vecCount > 1)
805+
{
806+
sb << '.' << char(i + 'x');
807+
}
808+
}
809+
if (isArray)
810+
{
811+
sb << ", int(($1)." << char(coordCount + 'x') << ")";
812+
}
813+
sb << ", SLANG_CUDA_BOUNDARY_MODE)\")\n";
814+
}
815+
else
816+
{
817+
sb << "__target_intrinsic(cuda, \"surfCubemap";
818+
if (isArray)
819+
{
820+
sb << "Layered";
821+
}
822+
sb << "read";
823+
sb << "<$T0>($0, ($1).x, ($1).y, ($1).z";
824+
if (isArray)
825+
{
826+
sb << ", int(($1).w)";
827+
}
828+
sb << ", SLANG_CUDA_BOUNDARY_MODE)\")\n";
829+
}
830+
}
831+
else if (access == SLANG_RESOURCE_ACCESS_READ)
832+
{
833+
// We can allow this on Texture1D
834+
if( baseShape == TextureFlavor::Shape::Shape1D && isArray == false)
835+
{
836+
sb << "__target_intrinsic(cuda, \"tex1Dfetch<$T0>($0, ($1).x)\")\n";
837+
}
838+
}
839+
}
840+
780841
sb << "T Load(";
781842
sb << "int" << loadCoordCount << " location";
782843
if(isMultisample)
@@ -785,6 +846,7 @@ for (int tt = 0; tt < kBaseTextureTypeCount; ++tt)
785846
}
786847
sb << ");\n";
787848

849+
// GLSL
788850
if (isMultisample)
789851
{
790852
sb << "__glsl_extension(GL_EXT_samplerless_texture_functions)";
@@ -804,6 +866,9 @@ for (int tt = 0; tt < kBaseTextureTypeCount; ++tt)
804866
}
805867
sb << ", $2)$z\")\n";
806868
}
869+
870+
871+
807872
sb << "T Load(";
808873
sb << "int" << loadCoordCount << " location";
809874
if(isMultisample)

source/slang/core.meta.slang.h

+66-1
Original file line numberDiff line numberDiff line change
@@ -798,6 +798,67 @@ for (int tt = 0; tt < kBaseTextureTypeCount; ++tt)
798798
sb << ")$z\")\n";
799799

800800
}
801+
802+
// CUDA
803+
if (isMultisample)
804+
{
805+
}
806+
else
807+
{
808+
if (access == SLANG_RESOURCE_ACCESS_READ_WRITE)
809+
{
810+
const int coordCount = kBaseTextureTypes[tt].coordCount;
811+
const int vecCount = coordCount + int(isArray);
812+
813+
if( baseShape != TextureFlavor::Shape::ShapeCube )
814+
{
815+
sb << "__target_intrinsic(cuda, \"surf" << coordCount << "D";
816+
if (isArray)
817+
{
818+
sb << "Layered";
819+
}
820+
sb << "read";
821+
sb << "<$T0>($0";
822+
for (int i = 0; i < coordCount; ++i)
823+
{
824+
sb << ", ($1)";
825+
if (vecCount > 1)
826+
{
827+
sb << '.' << char(i + 'x');
828+
}
829+
}
830+
if (isArray)
831+
{
832+
sb << ", int(($1)." << char(coordCount + 'x') << ")";
833+
}
834+
sb << ", SLANG_CUDA_BOUNDARY_MODE)\")\n";
835+
}
836+
else
837+
{
838+
sb << "__target_intrinsic(cuda, \"surfCubemap";
839+
if (isArray)
840+
{
841+
sb << "Layered";
842+
}
843+
sb << "read";
844+
sb << "<$T0>($0, ($1).x, ($1).y, ($1).z";
845+
if (isArray)
846+
{
847+
sb << ", int(($1).w)";
848+
}
849+
sb << ", SLANG_CUDA_BOUNDARY_MODE)\")\n";
850+
}
851+
}
852+
else if (access == SLANG_RESOURCE_ACCESS_READ)
853+
{
854+
// We can allow this on Texture1D
855+
if( baseShape == TextureFlavor::Shape::Shape1D && isArray == false)
856+
{
857+
sb << "__target_intrinsic(cuda, \"tex1Dfetch<$T0>($0, ($1).x)\")\n";
858+
}
859+
}
860+
}
861+
801862
sb << "T Load(";
802863
sb << "int" << loadCoordCount << " location";
803864
if(isMultisample)
@@ -806,6 +867,7 @@ for (int tt = 0; tt < kBaseTextureTypeCount; ++tt)
806867
}
807868
sb << ");\n";
808869

870+
// GLSL
809871
if (isMultisample)
810872
{
811873
sb << "__glsl_extension(GL_EXT_samplerless_texture_functions)";
@@ -825,6 +887,9 @@ for (int tt = 0; tt < kBaseTextureTypeCount; ++tt)
825887
}
826888
sb << ", $2)$z\")\n";
827889
}
890+
891+
892+
828893
sb << "T Load(";
829894
sb << "int" << loadCoordCount << " location";
830895
if(isMultisample)
@@ -1359,7 +1424,7 @@ for (auto op : binaryOps)
13591424
sb << "__intrinsic_op(" << int(op.opCode) << ") matrix<" << resultType << ",N,M> operator" << op.opName << "(" << leftQual << "matrix<" << leftType << ",N,M> left, " << rightType << " right);\n";
13601425
}
13611426
}
1362-
SLANG_RAW("#line 1341 \"core.meta.slang\"")
1427+
SLANG_RAW("#line 1406 \"core.meta.slang\"")
13631428
SLANG_RAW("\n")
13641429
SLANG_RAW("\n")
13651430
SLANG_RAW("// Specialized function\n")

source/slang/hlsl.meta.slang

+1-1
Original file line numberDiff line numberDiff line change
@@ -1433,7 +1433,7 @@ __generic<T : __BuiltinType, let N : int, let M : int> uint4 WaveMatch(matrix<T,
14331433

14341434
// TODO(JS): For CUDA the article claims mask has to be used carefully
14351435
// https://devblogs.nvidia.com/using-cuda-warp-level-primitives/
1436-
// With the Warp intrinsics there is though mask, and it's just the 'active lanes'. So __activemask()
1436+
// With the Warp intrinsics there is no mask, and it's just the 'active lanes'. So __activemask()
14371437
// seems to be appropriate.
14381438

14391439
__target_intrinsic(cuda, "(__all_sync(__activemask(), $0) != 0)")

source/slang/hlsl.meta.slang.h

+1-1
Original file line numberDiff line numberDiff line change
@@ -1509,7 +1509,7 @@ SLANG_RAW("__generic<T : __BuiltinType, let N : int, let M : int> uint4 WaveMatc
15091509
SLANG_RAW("\n")
15101510
SLANG_RAW("// TODO(JS): For CUDA the article claims mask has to be used carefully\n")
15111511
SLANG_RAW("// https://devblogs.nvidia.com/using-cuda-warp-level-primitives/\n")
1512-
SLANG_RAW("// With the Warp intrinsics there is though mask, and it's just the 'active lanes'. So __activemask()\n")
1512+
SLANG_RAW("// With the Warp intrinsics there is no mask, and it's just the 'active lanes'. So __activemask()\n")
15131513
SLANG_RAW("// seems to be appropriate.\n")
15141514
SLANG_RAW("\n")
15151515
SLANG_RAW("__target_intrinsic(cuda, \"(__all_sync(__activemask(), $0) != 0)\") \n")

tests/compute/rw-texture-simple.slang

+27
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
//TEST(compute):COMPARE_COMPUTE_EX:-cpu -compute
2+
// Doesn't work on DX11 currently - locks up on binding
3+
//DISABLE_TEST(compute):COMPARE_COMPUTE_EX:-slang -compute
4+
//TEST(compute):COMPARE_COMPUTE_EX:-slang -compute -dx12
5+
//TEST(compute):COMPARE_COMPUTE_EX:-slang -compute -dx12 -profile cs_6_0 -use-dxil
6+
// TODO(JS): Doesn't work on vk currently, because createTextureView not implemented on vk renderer
7+
//DISABLE_TEST(compute, vulkan):COMPARE_COMPUTE_EX:-vk -compute
8+
//TEST(compute):COMPARE_COMPUTE_EX:-cuda -compute
9+
10+
//TEST_INPUT: RWTexture1D(format=R_Float32, size=4, content = one):name rwt1D
11+
RWTexture1D<float> rwt1D;
12+
13+
//TEST_INPUT: ubuffer(data=[0 0 0 0], stride=4):out,name outputBuffer
14+
RWStructuredBuffer<float> outputBuffer;
15+
16+
[numthreads(4, 4, 1)]
17+
void computeMain(uint3 dispatchThreadID : SV_DispatchThreadID)
18+
{
19+
int idx = dispatchThreadID.x;
20+
float u = idx * (1.0f / 4);
21+
22+
float val = 0.0f;
23+
24+
val += rwt1D.Load(idx);
25+
26+
outputBuffer[idx] = val;
27+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
3F800000
2+
3F800000
3+
3F800000
4+
3F800000

tests/compute/texture-simple.slang

+7
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,10 @@
66
//DISABLE_TEST(compute, vulkan):COMPARE_COMPUTE_EX:-vk -compute
77
//TEST(compute):COMPARE_COMPUTE_EX:-cuda -compute
88

9+
// Doesn't work on CUDA, not clear why yet
10+
//DISABLE_TEST_INPUT: Texture1D(format=R_Float32, size=4, content = one, mipMaps=1):name tLoad1D
11+
//Texture1D<float> tLoad1D;
12+
913
//TEST_INPUT: Texture1D(size=4, content = one):name t1D
1014
Texture1D<float> t1D;
1115
//TEST_INPUT: Texture2D(size=4, content = one):name t2D
@@ -35,6 +39,7 @@ void computeMain(uint3 dispatchThreadID : SV_DispatchThreadID)
3539
float u = idx * (1.0f / 4);
3640

3741
float val = 0.0f;
42+
3843
val += t1D.SampleLevel(samplerState, u, 0);
3944
val += t2D.SampleLevel(samplerState, float2(u, u), 0);
4045
val += t3D.SampleLevel(samplerState, float3(u, u, u), 0);
@@ -44,5 +49,7 @@ void computeMain(uint3 dispatchThreadID : SV_DispatchThreadID)
4449
val += t2DArray.SampleLevel(samplerState, float3(u, u, 0), 0);
4550
val += tCubeArray.SampleLevel(samplerState, float4(u, u, u, 0), 0);
4651

52+
//val += tLoad1D.Load(int2(idx, 0));
53+
4754
outputBuffer[idx] = val;
4855
}

0 commit comments

Comments
 (0)