Skip to content

Commit d5e8044

Browse files
authored
Read half->float RWTexture conversion (shader-slang#1842)
* #include an absolute path didn't work - because paths were taken to always be relative. * Fix for writing to RWTexture with half types on CUDA. * CUDA half functionality doc updates. * First pass support for sust.p RWTexture format conversion on write. * Tidy up implementation of $C. Made clamping mode #define able. * A simple test for RWTexture CUDA format conversion. * Add support for float2 and float4. * WIP conversion testing. * Use $E to fix byte addressing in X in CUDA. * Do not scale when accessing via _convert versions of surface functions. * Revert to previous test. * Test with half/float convert write/read. * More broad half->float read conversion testing. * Improve documentation around half and RWTexture conversion.
1 parent bfe7561 commit d5e8044

8 files changed

+276
-34
lines changed

docs/cuda-target.md

+51-6
Original file line numberDiff line numberDiff line change
@@ -20,11 +20,11 @@ These limitations apply to Slang transpiling to CUDA.
2020
* Samplers are not separate objects in CUDA - they are combined into a single 'TextureObject'. So samplers are effectively ignored on CUDA targets.
2121
* When using a TextureArray.Sample (layered texture in CUDA) - the index will be treated as an int, as this is all CUDA allows
2222
* Care must be used in using `WaveGetLaneIndex` wave intrinsic - it will only give the right results for appropriate launches
23-
* CUDA 'surfaces' are used for textures which are read/write. CUDA does NOT do format conversion with surfaces.
23+
* CUDA 'surfaces' are used for textures which are read/write (aka RWTexture).
2424

2525
The following are a work in progress or not implemented but are planned to be so in the future
2626

27-
* Some resource types remain unsupported, and not all methods on types are supported
27+
* Some resource types remain unsupported, and not all methods on all types are supported
2828

2929
# How it works
3030

@@ -122,8 +122,6 @@ The UniformState and UniformEntryPointParams struct typically vary by shader. Un
122122
size_t sizeInBytes;
123123
```
124124

125-
126-
127125
## Texture
128126

129127
Read only textures will be bound as the opaque CUDA type CUtexObject. This type is the combination of both a texture AND a sampler. This is somewhat different from HLSL, where there can be separate `SamplerState` variables. This allows access of a single texture binding with different types of sampling.
@@ -138,11 +136,58 @@ Load is only supported for Texture1D, and the mip map selection argument is igno
138136

139137
RWTexture types are converted into CUsurfObject type.
140138

141-
In CUDA it is not possible to do a format conversion on an access to a CUsurfObject, so it must be backed by the same data format as is used within the Slang source code.
139+
In regular CUDA it is not possible to do a format conversion on an access to a CUsurfObject. Slang does add support for hardware write conversions where they are available. To enable the feature it is necessary to attribute your RWTexture with `format`. For example
140+
141+
```
142+
[format("rg16f")]
143+
RWTexture2D<float2> rwt2D_2;
144+
```
145+
146+
The format names used are the same as for (GLSL layout format types)[https://www.khronos.org/opengl/wiki/Layout_Qualifier_(GLSL)]. If no format is specified Slang will *assume* that the format is the same as the type specified.
147+
148+
Note that the format attribution is on variables/paramters/fields and not part of the type system. This means that if you have a scenario like...
149+
150+
```
151+
[format(rg16f)]
152+
RWTexture2d<float2> g_texture;
153+
154+
float2 getValue(RWTexture2D<float2> t)
155+
{
156+
return t[int2(0, 0];
157+
}
158+
159+
void doThing()
160+
{
161+
float2 v = getValue(g_texture);
162+
}
163+
```
164+
165+
Even `getValue` will receive t *without* the format attribute, and so will access it, presumably erroneously. A work around for this specific scenario would be to attribute the parameter
166+
167+
```
168+
float2 getValue([format("rg16f")] RWTexture2D<float2> t)
169+
{
170+
return t[int2(0, 0];
171+
}
172+
```
173+
174+
This will only work correctly if `getValue` is called with a `t` that has that format attribute. As it stands no checking is performed on this matching so no error or warning will be produced if there is a mismatch.
175+
176+
There is limited software support for doing a conversion on reading. Currently this only supports only 1D, 2D, 3D RWTexture, backed with half1, half2 or half4. For this path to work NVRTC must have the `cuda_fp16.h` and associated files available. Please check the section on `Half Support`.
177+
178+
If hardware read conversions are desired, this can be achieved by having a Texture<T> that uses the surface of a RWTexture<T>. Using the Texture<T> not only allows hardware conversion but also filtering.
142179

143180
It is also worth noting that CUsurfObjects in CUDA are NOT allowed to have mip maps.
144181

145-
By default surface access uses cudaBoundaryModeZero, this can be replaced using the macro SLANG_CUDA_BOUNDARY_MODE in the CUDA prelude.
182+
By default surface access uses cudaBoundaryModeZero, this can be replaced using the macro SLANG_CUDA_BOUNDARY_MODE in the CUDA prelude. For HW format conversions the macro SLANG_PTX_BOUNDARY_MODE. These boundary settings are in effect global for the whole of the kernel.
183+
184+
`SLANG_CUDA_BOUNDARY_MODE` can be one of
185+
186+
* cudaBoundaryModeZero causes an execution trap on out-of-bounds addresses
187+
* cudaBoundaryModeClamp stores data at the nearest surface location (sized appropriately)
188+
* cudaBoundaryModeTrap drops stores to out-of-bounds addresses
189+
190+
`SLANG_PTX_BOUNDARY_MODE` can be one of `trap`, `clamp` or `zero`. In general it is recommended to have both set to the same type of value, for example `cudaBoundaryModeZero` and `zero`.
146191

147192
## Sampler
148193

docs/target-compatibility.md

-3
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
Slang Target Compatibility
22
==========================
33

4-
54
Shader Model (SM) numbers are D3D Shader Model versions, unless explicitly stated otherwise.
65
OpenGL compatibility is not listed here, because OpenGL isn't an officially supported target.
76

@@ -203,8 +202,6 @@ uint64_t RWByteAddressBuffer::InterlockedMinU64(uint byteAddress, uint64_t value
203202
uint64_t RWByteAddressBuffer::InterlockedAndU64(uint byteAddress, uint64_t value);
204203
uint64_t RWByteAddressBuffer::InterlockedOrU64(uint byteAddress, uint64_t value);
205204
uint64_t RWByteAddressBuffer::InterlockedXorU64(uint byteAddress, uint64_t value);
206-
207-
208205
```
209206

210207
On HLSL based targets this functionality is achieved using [NVAPI](https://developer.nvidia.com/nvapi). Support for NVAPI is described

prelude/slang-cuda-prelude.h

+90
Original file line numberDiff line numberDiff line change
@@ -381,6 +381,41 @@ SLANG_SURFACE_WRITE(surf2DLayeredwrite, (int x, int y, int layer), (x, y, layer)
381381
SLANG_SURFACE_WRITE(surfCubemapwrite, (int x, int y, int face), (x, y, face))
382382
SLANG_SURFACE_WRITE(surfCubemapLayeredwrite, (int x, int y, int layerFace), (x, y, layerFace))
383383

384+
// ! Hack to test out reading !!!
385+
// Only works converting *from* half
386+
387+
//template <typename T>
388+
//SLANG_FORCE_INLINE SLANG_CUDA_CALL T surf2Dread_convert(cudaSurfaceObject_t surfObj, int x, int y, cudaSurfaceBoundaryMode boundaryMode);
389+
390+
#define SLANG_SURFACE_READ_HALF_CONVERT(FUNC_NAME, TYPE_ARGS, ARGS) \
391+
\
392+
template <typename T> \
393+
SLANG_FORCE_INLINE SLANG_CUDA_CALL T FUNC_NAME##_convert(cudaSurfaceObject_t surfObj, SLANG_DROP_PARENS TYPE_ARGS, cudaSurfaceBoundaryMode boundaryMode); \
394+
\
395+
template <> \
396+
SLANG_FORCE_INLINE SLANG_CUDA_CALL float FUNC_NAME##_convert<float>(cudaSurfaceObject_t surfObj, SLANG_DROP_PARENS TYPE_ARGS, cudaSurfaceBoundaryMode boundaryMode) \
397+
{ \
398+
return __ushort_as_half(FUNC_NAME<uint16_t>(surfObj, SLANG_DROP_PARENS ARGS, boundaryMode)); \
399+
} \
400+
\
401+
template <> \
402+
SLANG_FORCE_INLINE SLANG_CUDA_CALL float2 FUNC_NAME##_convert<float2>(cudaSurfaceObject_t surfObj, SLANG_DROP_PARENS TYPE_ARGS, cudaSurfaceBoundaryMode boundaryMode) \
403+
{ \
404+
const __half2 v = __ushort_as_half(FUNC_NAME<ushort2>(surfObj, SLANG_DROP_PARENS ARGS, boundaryMode)); \
405+
return float2{v.x, v.y}; \
406+
} \
407+
\
408+
template <> \
409+
SLANG_FORCE_INLINE SLANG_CUDA_CALL float4 FUNC_NAME##_convert<float4>(cudaSurfaceObject_t surfObj, SLANG_DROP_PARENS TYPE_ARGS, cudaSurfaceBoundaryMode boundaryMode) \
410+
{ \
411+
const __half4 v = __ushort_as_half(FUNC_NAME<ushort4>(surfObj, SLANG_DROP_PARENS ARGS, boundaryMode)); \
412+
return float4{v.xy.x, v.xy.y, v.zw.x, v.zw.y}; \
413+
}
414+
415+
SLANG_SURFACE_READ_HALF_CONVERT(surf1Dread, (int x), (x))
416+
SLANG_SURFACE_READ_HALF_CONVERT(surf2Dread, (int x, int y), (x, y))
417+
SLANG_SURFACE_READ_HALF_CONVERT(surf3Dread, (int x, int y, int z), (x, y, z))
418+
384419
#endif
385420

386421
// Support for doing format conversion when writing to a surface/RWTexture
@@ -392,10 +427,14 @@ template <typename T>
392427
SLANG_FORCE_INLINE SLANG_CUDA_CALL void surf1Dwrite_convert(T, cudaSurfaceObject_t surfObj, int x, cudaSurfaceBoundaryMode boundaryMode);
393428
template <typename T>
394429
SLANG_FORCE_INLINE SLANG_CUDA_CALL void surf2Dwrite_convert(T, cudaSurfaceObject_t surfObj, int x, int y, cudaSurfaceBoundaryMode boundaryMode);
430+
template <typename T>
431+
SLANG_FORCE_INLINE SLANG_CUDA_CALL void surf3Dwrite_convert(T, cudaSurfaceObject_t surfObj, int x, int y, int z, cudaSurfaceBoundaryMode boundaryMode);
395432

396433
// https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html
397434
// https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#surface-instructions-sust
398435

436+
// Float
437+
399438
template <>
400439
SLANG_FORCE_INLINE SLANG_CUDA_CALL void surf1Dwrite_convert<float>(float v, cudaSurfaceObject_t surfObj, int x, cudaSurfaceBoundaryMode boundaryMode)
401440
{
@@ -408,6 +447,57 @@ SLANG_FORCE_INLINE SLANG_CUDA_CALL void surf2Dwrite_convert<float>(float v, cuda
408447
asm volatile ( "{sust.p.2d.b32." SLANG_PTX_BOUNDARY_MODE " [%0, {%1,%2}], {%3};}\n\t" :: "l"(surfObj),"r"(x),"r"(y),"f"(v));
409448
}
410449

450+
template <>
451+
SLANG_FORCE_INLINE SLANG_CUDA_CALL void surf3Dwrite_convert<float>(float v, cudaSurfaceObject_t surfObj, int x, int y, int z, cudaSurfaceBoundaryMode boundaryMode)
452+
{
453+
asm volatile ( "{sust.p.2d.b32." SLANG_PTX_BOUNDARY_MODE " [%0, {%1,%2,%3}], {%4};}\n\t" :: "l"(surfObj),"r"(x),"r"(y),"r"(z),"f"(v));
454+
}
455+
456+
// Float2
457+
458+
template <>
459+
SLANG_FORCE_INLINE SLANG_CUDA_CALL void surf1Dwrite_convert<float2>(float2 v, cudaSurfaceObject_t surfObj, int x, cudaSurfaceBoundaryMode boundaryMode)
460+
{
461+
const float vx = v.x, vy = v.y;
462+
asm volatile ( "{sust.p.1d.b32." SLANG_PTX_BOUNDARY_MODE " [%0, {%1}], {%2,%3};}\n\t" :: "l"(surfObj),"r"(x),"f"(vx),"f"(vy));
463+
}
464+
465+
template <>
466+
SLANG_FORCE_INLINE SLANG_CUDA_CALL void surf2Dwrite_convert<float2>(float2 v, cudaSurfaceObject_t surfObj, int x, int y, cudaSurfaceBoundaryMode boundaryMode)
467+
{
468+
const float vx = v.x, vy = v.y;
469+
asm volatile ( "{sust.p.2d.b32." SLANG_PTX_BOUNDARY_MODE " [%0, {%1,%2}], {%3,%4};}\n\t" :: "l"(surfObj),"r"(x),"r"(y),"f"(vx),"f"(vy));
470+
}
471+
472+
template <>
473+
SLANG_FORCE_INLINE SLANG_CUDA_CALL void surf3Dwrite_convert<float2>(float2 v, cudaSurfaceObject_t surfObj, int x, int y, int z, cudaSurfaceBoundaryMode boundaryMode)
474+
{
475+
const float vx = v.x, vy = v.y;
476+
asm volatile ( "{sust.p.2d.b32." SLANG_PTX_BOUNDARY_MODE " [%0, {%1,%2,%3}], {%4,%5};}\n\t" :: "l"(surfObj),"r"(x),"r"(y),"r"(z),"f"(vx),"f"(vy));
477+
}
478+
479+
// Float4
480+
template <>
481+
SLANG_FORCE_INLINE SLANG_CUDA_CALL void surf1Dwrite_convert<float4>(float4 v, cudaSurfaceObject_t surfObj, int x, cudaSurfaceBoundaryMode boundaryMode)
482+
{
483+
const float vx = v.x, vy = v.y, vz = v.z, vw = v.w;
484+
asm volatile ( "{sust.p.1d.b32." SLANG_PTX_BOUNDARY_MODE " [%0, {%1}], {%2,%3,%4,%5};}\n\t" :: "l"(surfObj),"r"(x),"f"(vx),"f"(vy),"f"(vz),"f"(vw));
485+
}
486+
487+
template <>
488+
SLANG_FORCE_INLINE SLANG_CUDA_CALL void surf2Dwrite_convert<float4>(float4 v, cudaSurfaceObject_t surfObj, int x, int y, cudaSurfaceBoundaryMode boundaryMode)
489+
{
490+
const float vx = v.x, vy = v.y, vz = v.z, vw = v.w;
491+
asm volatile ( "{sust.p.2d.b32." SLANG_PTX_BOUNDARY_MODE " [%0, {%1,%2}], {%3,%4,%5,%6};}\n\t" :: "l"(surfObj),"r"(x),"r"(y),"f"(vx),"f"(vy),"f"(vz),"f"(vw));
492+
}
493+
494+
template <>
495+
SLANG_FORCE_INLINE SLANG_CUDA_CALL void surf3Dwrite_convert<float4>(float4 v, cudaSurfaceObject_t surfObj, int x, int y, int z, cudaSurfaceBoundaryMode boundaryMode)
496+
{
497+
const float vx = v.x, vy = v.y, vz = v.z, vw = v.w;
498+
asm volatile ( "{sust.p.2d.b32." SLANG_PTX_BOUNDARY_MODE " [%0, {%1,%2,%3}], {%4,%5,%6,%7};}\n\t" :: "l"(surfObj),"r"(x),"r"(y),"r"(z),"f"(vx),"f"(vy),"f"(vz),"f"(vw));
499+
}
500+
411501
// ----------------------------- F32 -----------------------------------------
412502

413503
// Unary

source/slang/core.meta.slang

+1-1
Original file line numberDiff line numberDiff line change
@@ -1083,7 +1083,7 @@ for (int tt = 0; tt < kBaseTextureTypeCount; ++tt)
10831083
}
10841084

10851085
sb << (isArray ? "Layered" : "");
1086-
sb << "read<$T0>($0";
1086+
sb << "read$C<$T0>($0";
10871087

10881088
for (int i = 0; i < vecCount; ++i)
10891089
{

source/slang/slang-intrinsic-expand.cpp

+63-23
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
// slang-intrinsic-expand.cpp
22
#include "slang-intrinsic-expand.h"
33

4+
#include "slang-emit-cuda.h"
5+
46
namespace Slang {
57

68
void IntrinsicExpandContext::emit(IRCall* inst, IRUse* args, Int argCount, const UnownedStringSlice& intrinsicText)
@@ -101,13 +103,13 @@ static BaseType _getBaseTypeFromScalarType(SlangScalarType type)
101103
// The VK back-end gets away with this kind of coincidentally, since the "legalization" we have to do for resources means that there wouldn't be a single f() function any more.
102104
// But for CUDA and C++ that's not the case or generally desirable.
103105

104-
static IRFormatDecoration* _findImageFormatDecoration(IRInst* inst)
106+
static IRFormatDecoration* _findImageFormatDecoration(IRInst* resourceInst)
105107
{
106108
// JS(TODO):
107109
// There could perhaps be other situations, that need to be covered
108110

109111
// If this is a load, we need to get the decoration from the field key
110-
if (IRLoad* load = as<IRLoad>(inst))
112+
if (IRLoad* load = as<IRLoad>(resourceInst))
111113
{
112114
if (IRFieldAddress* fieldAddress = as<IRFieldAddress>(load->getOperand(0)))
113115
{
@@ -116,7 +118,7 @@ static IRFormatDecoration* _findImageFormatDecoration(IRInst* inst)
116118
}
117119
}
118120
// Otherwise just try on the instruction
119-
return inst->findDecoration<IRFormatDecoration>();
121+
return resourceInst->findDecoration<IRFormatDecoration>();
120122
}
121123

122124
// Returns true if dataType and imageFormat are compatible - that they have the same representation,
@@ -149,36 +151,26 @@ static bool _isImageFormatCompatible(ImageFormat imageFormat, IRType* dataType)
149151
return formatBaseType == baseType;
150152
}
151153

152-
static bool _isConvertRequired(ImageFormat imageFormat, IRInst* resourceVar)
154+
static bool _isConvertRequired(ImageFormat imageFormat, IRInst* callee)
153155
{
154-
auto textureType = as<IRTextureTypeBase>(resourceVar->getDataType());
156+
auto textureType = as<IRTextureTypeBase>(callee->getDataType());
155157
IRType* elementType = textureType ? textureType->getElementType() : nullptr;
156158
return elementType && !_isImageFormatCompatible(imageFormat, elementType);
157159
}
158160

159-
static size_t _calcBackingElementSizeInBytes(IRInst* resourceVar)
161+
static size_t _calcBackingElementSizeInBytes(IRInst* resourceInst)
160162
{
161163
// First see if there is a format associated with the resource
162-
if (IRFormatDecoration* formatDecoration = _findImageFormatDecoration(resourceVar))
164+
if (IRFormatDecoration* formatDecoration = _findImageFormatDecoration(resourceInst))
163165
{
164-
const ImageFormat imageFormat = formatDecoration->getFormat();
165-
166-
if (_isConvertRequired(imageFormat, resourceVar))
167-
{
168-
// If the access is a converting access then the x coordinate is *NOT* scaled
169-
// This is a CUDA specific issue(!).
170-
return 1;
171-
}
172-
173-
const auto& imageFormatInfo = getImageFormatInfo(imageFormat);
174-
return imageFormatInfo.sizeInBytes;
166+
return getImageFormatInfo(formatDecoration->getFormat()).sizeInBytes;
175167
}
176168
else
177169
{
178170
// If not we *assume* the backing format is the same as the element type used for access.
179171
/// Ie in RWTexture<T>, this would return sizeof(T)
180172

181-
auto textureType = as<IRTextureTypeBase>(resourceVar->getDataType());
173+
auto textureType = as<IRTextureTypeBase>(resourceInst->getDataType());
182174
IRType* elementType = textureType ? textureType->getElementType() : nullptr;
183175

184176
if (elementType)
@@ -206,6 +198,18 @@ static size_t _calcBackingElementSizeInBytes(IRInst* resourceVar)
206198
return 4;
207199
}
208200

201+
static bool _isResourceRead(IRCall* call)
202+
{
203+
IRType* returnType = call->getDataType();
204+
return returnType && (as<IRVoidType>(returnType) == nullptr);
205+
}
206+
207+
static bool _isResourceWrite(IRCall* call)
208+
{
209+
IRType* returnType = call->getDataType();
210+
return returnType && (as<IRVoidType>(returnType) != nullptr);
211+
}
212+
209213
const char* IntrinsicExpandContext::_emitSpecial(const char* cursor)
210214
{
211215
const char*const end = m_text.end();
@@ -323,13 +327,35 @@ const char* IntrinsicExpandContext::_emitSpecial(const char* cursor)
323327
// writes that will do a format conversion.
324328
if (m_emitter->getTarget() == CodeGenTarget::CUDASource)
325329
{
326-
IRInst* arg0 = m_callInst->getArg(0);
330+
IRInst* resourceInst = m_callInst->getArg(0);
327331

328-
if (IRFormatDecoration* formatDecoration = _findImageFormatDecoration(arg0))
332+
if (IRFormatDecoration* formatDecoration = _findImageFormatDecoration(resourceInst))
329333
{
330334
const ImageFormat imageFormat = formatDecoration->getFormat();
331-
if (_isConvertRequired(imageFormat, arg0))
335+
if (_isConvertRequired(imageFormat, resourceInst))
332336
{
337+
// If the function returns something it's a reader so we may need to convert
338+
// and in doing so require half
339+
if (_isResourceRead(m_callInst))
340+
{
341+
// If the source format if half derived, then we need to enable half
342+
switch (imageFormat)
343+
{
344+
case ImageFormat::r16f:
345+
case ImageFormat::rg16f:
346+
case ImageFormat::rgba16f:
347+
{
348+
CUDAExtensionTracker* extensionTracker = as<CUDAExtensionTracker>(m_emitter->getExtensionTracker());
349+
if (extensionTracker)
350+
{
351+
extensionTracker->requireBaseType(BaseType::Half);
352+
}
353+
break;
354+
}
355+
default: break;
356+
}
357+
}
358+
333359
// Append _convert on the name to signify we need to use a code path, that will automatically
334360
// do the format conversion.
335361
m_writer->emit("_convert");
@@ -344,7 +370,21 @@ const char* IntrinsicExpandContext::_emitSpecial(const char* cursor)
344370
/// Sometimes accesses need to be scaled. For example in CUDA the x coordinate for surface
345371
/// access is byte addressed.
346372
/// $E will return the byte size of the *backing element*.
347-
size_t elemSizeInBytes = _calcBackingElementSizeInBytes(m_callInst->getArg(0));
373+
374+
IRInst* resourceInst = m_callInst->getArg(0);
375+
size_t elemSizeInBytes = _calcBackingElementSizeInBytes(resourceInst);
376+
377+
// If we have a format converstion and its a *write* we don't need to scale
378+
if (IRFormatDecoration* formatDecoration = _findImageFormatDecoration(resourceInst))
379+
{
380+
const ImageFormat imageFormat = formatDecoration->getFormat();
381+
if (_isConvertRequired(imageFormat, resourceInst) && _isResourceWrite(m_callInst))
382+
{
383+
// If there is a conversion *and* it's a write we don't need to scale.
384+
elemSizeInBytes = 1;
385+
}
386+
}
387+
348388
SLANG_ASSERT(elemSizeInBytes > 0);
349389
m_writer->emitUInt64(UInt64(elemSizeInBytes));
350390
break;

0 commit comments

Comments
 (0)