skallweitNV
diff --git a/‎docs/cuda-target.md
+51-6 b/‎docs/cuda-target.md
+51-6
diff --git a/‎docs/target-compatibility.md
-3 b/‎docs/target-compatibility.md
-3
diff --git a/‎prelude/slang-cuda-prelude.h
+90 b/‎prelude/slang-cuda-prelude.h
+90
diff --git a/‎source/slang/core.meta.slang
+1-1 b/‎source/slang/core.meta.slang
+1-1
diff --git a/‎source/slang/slang-intrinsic-expand.cpp
+63-23 b/‎source/slang/slang-intrinsic-expand.cpp
+63-23
@@ -20,11 +20,11 @@ These limitations apply to Slang transpiling to CUDA.
 * Samplers are not separate objects in CUDA - they are combined into a single 'TextureObject'. So samplers are effectively ignored on CUDA targets. 
 * When using a TextureArray.Sample (layered texture in CUDA) - the index will be treated as an int, as this is all CUDA allows
 * Care must be used in using `WaveGetLaneIndex` wave intrinsic - it will only give the right results for appropriate launches
-* CUDA 'surfaces' are used for textures which are read/write. CUDA does NOT do format conversion with surfaces.
+* CUDA 'surfaces' are used for textures which are read/write (aka RWTexture). 
 
 The following are a work in progress or not implemented but are planned to be so in the future
 
-* Some resource types remain unsupported, and not all methods on types are supported
+* Some resource types remain unsupported, and not all methods on all types are supported
 
 # How it works
 
@@ -122,8 +122,6 @@ The UniformState and UniformEntryPointParams struct typically vary by shader. Un
     size_t sizeInBytes;
 ```  
 
-
-
 ## Texture
 
 Read only textures will be bound as the opaque CUDA type CUtexObject. This type is the combination of both a texture AND a sampler. This is somewhat different from HLSL, where there can be separate `SamplerState` variables. This allows access of a single texture binding with different types of sampling. 
@@ -138,11 +136,58 @@ Load is only supported for Texture1D, and the mip map selection argument is igno
 
 RWTexture types are converted into CUsurfObject type. 
 
-In CUDA it is not possible to do a format conversion on an access to a CUsurfObject, so it must be backed by the same data format as is used within the Slang source code. 
+In regular CUDA it is not possible to do a format conversion on an access to a CUsurfObject. Slang does add support for hardware write conversions where they are available. To enable the feature it is necessary to attribute your RWTexture with `format`. For example
+
+```
+[format("rg16f")]
+RWTexture2D<float2> rwt2D_2;
+```
+
+The format names used are the same as for (GLSL layout format types)[https://www.khronos.org/opengl/wiki/Layout_Qualifier_(GLSL)]. If no format is specified Slang will *assume* that the format is the same as the type specified. 
+
+Note that the format attribution is on variables/paramters/fields and not part of the type system. This means that if you have a scenario like...
+
+```
+[format(rg16f)]
+RWTexture2d<float2> g_texture;
+
+float2 getValue(RWTexture2D<float2> t)
+{
+    return t[int2(0, 0];
+}
+
+void doThing()
+{
+    float2 v = getValue(g_texture);
+}
+```
+
+Even `getValue` will receive t *without* the format attribute, and so will access it, presumably erroneously. A work around for this specific scenario would be to attribute the parameter
+
+```
+float2 getValue([format("rg16f")] RWTexture2D<float2> t)
+{
+    return t[int2(0, 0];
+}
+```
+
+This will only work correctly if `getValue` is called with a `t` that has that format attribute. As it stands no checking is performed on this matching so no error or warning will be produced if there is a mismatch. 
+
+There is limited software support for doing a conversion on reading. Currently this only supports only 1D, 2D, 3D RWTexture, backed with half1, half2 or half4. For this path to work NVRTC must have the `cuda_fp16.h` and associated files available. Please check the section on `Half Support`.
+
+If hardware read conversions are desired, this can be achieved by having a Texture<T> that uses the surface of a RWTexture<T>. Using the Texture<T> not only allows hardware conversion but also filtering. 
 
 It is also worth noting that CUsurfObjects in CUDA are NOT allowed to have mip maps. 
 
-By default surface access uses cudaBoundaryModeZero, this can be replaced using the macro SLANG_CUDA_BOUNDARY_MODE in the CUDA prelude.
+By default surface access uses cudaBoundaryModeZero, this can be replaced using the macro SLANG_CUDA_BOUNDARY_MODE in the CUDA prelude. For HW format conversions the macro SLANG_PTX_BOUNDARY_MODE. These boundary settings are in effect global for the whole of the kernel. 
+
+`SLANG_CUDA_BOUNDARY_MODE` can be one of
+
+* cudaBoundaryModeZero      causes an execution trap on out-of-bounds addresses
+* cudaBoundaryModeClamp     stores data at the nearest surface location (sized appropriately)
+* cudaBoundaryModeTrap      drops stores to out-of-bounds addresses 
+
+`SLANG_PTX_BOUNDARY_MODE` can be one of `trap`, `clamp` or `zero`. In general it is recommended to have both set to the same type of value, for example `cudaBoundaryModeZero` and `zero`.
 
 ## Sampler
 
 
@@ -1,7 +1,6 @@
 Slang Target Compatibility 
 ==========================
 
-
 Shader Model (SM) numbers are D3D Shader Model versions, unless explicitly stated otherwise.
 OpenGL compatibility is not listed here, because OpenGL isn't an officially supported target. 
 
@@ -203,8 +202,6 @@ uint64_t RWByteAddressBuffer::InterlockedMinU64(uint byteAddress, uint64_t value
 uint64_t RWByteAddressBuffer::InterlockedAndU64(uint byteAddress, uint64_t value);
 uint64_t RWByteAddressBuffer::InterlockedOrU64(uint byteAddress, uint64_t value);
 uint64_t RWByteAddressBuffer::InterlockedXorU64(uint byteAddress, uint64_t value);
-
-
 ```
 
 On HLSL based targets this functionality is achieved using [NVAPI](https://developer.nvidia.com/nvapi). Support for NVAPI is described
 
@@ -381,6 +381,41 @@ SLANG_SURFACE_WRITE(surf2DLayeredwrite, (int x, int y, int layer), (x, y, layer)
 SLANG_SURFACE_WRITE(surfCubemapwrite, (int x, int y, int face), (x, y, face))
 SLANG_SURFACE_WRITE(surfCubemapLayeredwrite, (int x, int y, int layerFace), (x, y, layerFace))
 
+// ! Hack to test out reading !!!
+// Only works converting *from* half 
+ 
+//template <typename T> 
+//SLANG_FORCE_INLINE SLANG_CUDA_CALL T surf2Dread_convert(cudaSurfaceObject_t surfObj, int x, int y, cudaSurfaceBoundaryMode boundaryMode);
+
+#define SLANG_SURFACE_READ_HALF_CONVERT(FUNC_NAME, TYPE_ARGS, ARGS) \
+\
+template <typename T>  \
+SLANG_FORCE_INLINE SLANG_CUDA_CALL T FUNC_NAME##_convert(cudaSurfaceObject_t surfObj, SLANG_DROP_PARENS TYPE_ARGS, cudaSurfaceBoundaryMode boundaryMode); \
+\
+template <> \
+SLANG_FORCE_INLINE SLANG_CUDA_CALL float FUNC_NAME##_convert<float>(cudaSurfaceObject_t surfObj, SLANG_DROP_PARENS TYPE_ARGS, cudaSurfaceBoundaryMode boundaryMode)  \
+{ \
+    return __ushort_as_half(FUNC_NAME<uint16_t>(surfObj, SLANG_DROP_PARENS ARGS, boundaryMode)); \
+} \
+\
+template <> \
+SLANG_FORCE_INLINE SLANG_CUDA_CALL float2 FUNC_NAME##_convert<float2>(cudaSurfaceObject_t surfObj, SLANG_DROP_PARENS TYPE_ARGS, cudaSurfaceBoundaryMode boundaryMode) \
+{ \
+    const __half2 v = __ushort_as_half(FUNC_NAME<ushort2>(surfObj, SLANG_DROP_PARENS ARGS, boundaryMode)); \
+    return float2{v.x, v.y}; \
+} \
+\
+template <> \
+SLANG_FORCE_INLINE SLANG_CUDA_CALL float4 FUNC_NAME##_convert<float4>(cudaSurfaceObject_t surfObj, SLANG_DROP_PARENS TYPE_ARGS, cudaSurfaceBoundaryMode boundaryMode) \
+{ \
+    const __half4 v = __ushort_as_half(FUNC_NAME<ushort4>(surfObj, SLANG_DROP_PARENS ARGS, boundaryMode)); \
+    return float4{v.xy.x, v.xy.y, v.zw.x, v.zw.y}; \
+}
+
+SLANG_SURFACE_READ_HALF_CONVERT(surf1Dread, (int x), (x)) 
+SLANG_SURFACE_READ_HALF_CONVERT(surf2Dread, (int x, int y), (x, y)) 
+SLANG_SURFACE_READ_HALF_CONVERT(surf3Dread, (int x, int y, int z), (x, y, z))
+
 #endif
 
 // Support for doing format conversion when writing to a surface/RWTexture
@@ -392,10 +427,14 @@ template <typename T>
 SLANG_FORCE_INLINE SLANG_CUDA_CALL void surf1Dwrite_convert(T, cudaSurfaceObject_t surfObj, int x, cudaSurfaceBoundaryMode boundaryMode);
 template <typename T>
 SLANG_FORCE_INLINE SLANG_CUDA_CALL void surf2Dwrite_convert(T, cudaSurfaceObject_t surfObj, int x, int y, cudaSurfaceBoundaryMode boundaryMode);
+template <typename T>
+SLANG_FORCE_INLINE SLANG_CUDA_CALL void surf3Dwrite_convert(T, cudaSurfaceObject_t surfObj, int x, int y, int z, cudaSurfaceBoundaryMode boundaryMode);
 
 // https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html
 // https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#surface-instructions-sust
 
+// Float
+
 template <>
 SLANG_FORCE_INLINE SLANG_CUDA_CALL void surf1Dwrite_convert<float>(float v, cudaSurfaceObject_t surfObj, int x, cudaSurfaceBoundaryMode boundaryMode)
 {
@@ -408,6 +447,57 @@ SLANG_FORCE_INLINE SLANG_CUDA_CALL void surf2Dwrite_convert<float>(float v, cuda
     asm volatile ( "{sust.p.2d.b32." SLANG_PTX_BOUNDARY_MODE " [%0, {%1,%2}], {%3};}\n\t" :: "l"(surfObj),"r"(x),"r"(y),"f"(v));
 }
 
+template <>
+SLANG_FORCE_INLINE SLANG_CUDA_CALL void surf3Dwrite_convert<float>(float v, cudaSurfaceObject_t surfObj, int x, int y, int z, cudaSurfaceBoundaryMode boundaryMode)
+{
+    asm volatile ( "{sust.p.2d.b32." SLANG_PTX_BOUNDARY_MODE " [%0, {%1,%2,%3}], {%4};}\n\t" :: "l"(surfObj),"r"(x),"r"(y),"r"(z),"f"(v));
+}
+
+// Float2
+
+template <>
+SLANG_FORCE_INLINE SLANG_CUDA_CALL void surf1Dwrite_convert<float2>(float2 v, cudaSurfaceObject_t surfObj, int x, cudaSurfaceBoundaryMode boundaryMode)
+{
+    const float vx = v.x, vy = v.y;
+    asm volatile ( "{sust.p.1d.b32." SLANG_PTX_BOUNDARY_MODE " [%0, {%1}], {%2,%3};}\n\t" :: "l"(surfObj),"r"(x),"f"(vx),"f"(vy));     
+}
+ 
+template <>
+SLANG_FORCE_INLINE SLANG_CUDA_CALL void surf2Dwrite_convert<float2>(float2 v, cudaSurfaceObject_t surfObj, int x, int y, cudaSurfaceBoundaryMode boundaryMode)
+{
+    const float vx = v.x, vy = v.y;
+    asm volatile ( "{sust.p.2d.b32." SLANG_PTX_BOUNDARY_MODE " [%0, {%1,%2}], {%3,%4};}\n\t" :: "l"(surfObj),"r"(x),"r"(y),"f"(vx),"f"(vy));
+}
+
+template <>
+SLANG_FORCE_INLINE SLANG_CUDA_CALL void surf3Dwrite_convert<float2>(float2 v, cudaSurfaceObject_t surfObj, int x, int y, int z, cudaSurfaceBoundaryMode boundaryMode)
+{
+    const float vx = v.x, vy = v.y;
+    asm volatile ( "{sust.p.2d.b32." SLANG_PTX_BOUNDARY_MODE " [%0, {%1,%2,%3}], {%4,%5};}\n\t" :: "l"(surfObj),"r"(x),"r"(y),"r"(z),"f"(vx),"f"(vy));
+}
+
+// Float4
+template <>
+SLANG_FORCE_INLINE SLANG_CUDA_CALL void surf1Dwrite_convert<float4>(float4 v, cudaSurfaceObject_t surfObj, int x, cudaSurfaceBoundaryMode boundaryMode)
+{
+    const float vx = v.x, vy = v.y, vz = v.z, vw = v.w;
+    asm volatile ( "{sust.p.1d.b32." SLANG_PTX_BOUNDARY_MODE " [%0, {%1}], {%2,%3,%4,%5};}\n\t" :: "l"(surfObj),"r"(x),"f"(vx),"f"(vy),"f"(vz),"f"(vw));     
+}
+ 
+template <>
+SLANG_FORCE_INLINE SLANG_CUDA_CALL void surf2Dwrite_convert<float4>(float4 v, cudaSurfaceObject_t surfObj, int x, int y, cudaSurfaceBoundaryMode boundaryMode)
+{
+    const float vx = v.x, vy = v.y, vz = v.z, vw = v.w;
+    asm volatile ( "{sust.p.2d.b32." SLANG_PTX_BOUNDARY_MODE " [%0, {%1,%2}], {%3,%4,%5,%6};}\n\t" :: "l"(surfObj),"r"(x),"r"(y),"f"(vx),"f"(vy),"f"(vz),"f"(vw));
+}
+
+template <>
+SLANG_FORCE_INLINE SLANG_CUDA_CALL void surf3Dwrite_convert<float4>(float4 v, cudaSurfaceObject_t surfObj, int x, int y, int z, cudaSurfaceBoundaryMode boundaryMode)
+{
+    const float vx = v.x, vy = v.y, vz = v.z, vw = v.w;
+    asm volatile ( "{sust.p.2d.b32." SLANG_PTX_BOUNDARY_MODE " [%0, {%1,%2,%3}], {%4,%5,%6,%7};}\n\t" :: "l"(surfObj),"r"(x),"r"(y),"r"(z),"f"(vx),"f"(vy),"f"(vz),"f"(vw));
+}
+
 // ----------------------------- F32 -----------------------------------------
 
 // Unary 
 
@@ -1083,7 +1083,7 @@ for (int tt = 0; tt < kBaseTextureTypeCount; ++tt)
                             }
 
                             sb << (isArray ? "Layered" : "");
-                            sb << "read<$T0>($0";
+                            sb << "read$C<$T0>($0";
 
                             for (int i = 0; i < vecCount; ++i)
                             {
 
@@ -1,6 +1,8 @@
 // slang-intrinsic-expand.cpp
 #include "slang-intrinsic-expand.h"
 
+#include "slang-emit-cuda.h"
+
 namespace Slang {
 
 void IntrinsicExpandContext::emit(IRCall* inst, IRUse* args, Int argCount, const UnownedStringSlice& intrinsicText)
@@ -101,13 +103,13 @@ static BaseType _getBaseTypeFromScalarType(SlangScalarType type)
 // The VK back-end gets away with this kind of coincidentally, since the "legalization" we have to do for resources means that there wouldn't be a single f() function any more.
 // But for CUDA and C++ that's not the case or generally desirable.
 
-static IRFormatDecoration* _findImageFormatDecoration(IRInst* inst)
+static IRFormatDecoration* _findImageFormatDecoration(IRInst* resourceInst)
 {
     // JS(TODO):
     // There could perhaps be other situations, that need to be covered
 
     // If this is a load, we need to get the decoration from the field key
-    if (IRLoad* load = as<IRLoad>(inst))
+    if (IRLoad* load = as<IRLoad>(resourceInst))
     {
         if (IRFieldAddress* fieldAddress = as<IRFieldAddress>(load->getOperand(0)))
         {
@@ -116,7 +118,7 @@ static IRFormatDecoration* _findImageFormatDecoration(IRInst* inst)
         }
     }
     // Otherwise just try on the instruction
-    return inst->findDecoration<IRFormatDecoration>();
+    return resourceInst->findDecoration<IRFormatDecoration>();
 }
 
 // Returns true if dataType and imageFormat are compatible - that they have the same representation,
@@ -149,36 +151,26 @@ static bool _isImageFormatCompatible(ImageFormat imageFormat, IRType* dataType)
     return formatBaseType == baseType;
 }
 
-static bool _isConvertRequired(ImageFormat imageFormat, IRInst* resourceVar)
+static bool _isConvertRequired(ImageFormat imageFormat, IRInst* callee)
 {
-    auto textureType = as<IRTextureTypeBase>(resourceVar->getDataType());
+    auto textureType = as<IRTextureTypeBase>(callee->getDataType());
     IRType* elementType = textureType ? textureType->getElementType() : nullptr;
     return elementType && !_isImageFormatCompatible(imageFormat, elementType);
 }
 
-static size_t _calcBackingElementSizeInBytes(IRInst* resourceVar)
+static size_t _calcBackingElementSizeInBytes(IRInst* resourceInst)
 {
     // First see if there is a format associated with the resource
-    if (IRFormatDecoration* formatDecoration = _findImageFormatDecoration(resourceVar))
+    if (IRFormatDecoration* formatDecoration = _findImageFormatDecoration(resourceInst))
     {
-        const ImageFormat imageFormat = formatDecoration->getFormat();
-
-        if (_isConvertRequired(imageFormat, resourceVar))
-        {
-            // If the access is a converting access then the x coordinate is *NOT* scaled
-            // This is a CUDA specific issue(!).
-            return 1;
-        }
-
-        const auto& imageFormatInfo = getImageFormatInfo(imageFormat);
-        return imageFormatInfo.sizeInBytes;
+        return getImageFormatInfo(formatDecoration->getFormat()).sizeInBytes;
     }
     else
     {
         // If not we *assume* the backing format is the same as the element type used for access.
         /// Ie in RWTexture<T>, this would return sizeof(T)
 
-        auto textureType = as<IRTextureTypeBase>(resourceVar->getDataType());
+        auto textureType = as<IRTextureTypeBase>(resourceInst->getDataType());
         IRType* elementType = textureType ? textureType->getElementType() : nullptr;
 
         if (elementType)
@@ -206,6 +198,18 @@ static size_t _calcBackingElementSizeInBytes(IRInst* resourceVar)
     return 4;
 }
 
+static bool _isResourceRead(IRCall* call)
+{
+    IRType* returnType = call->getDataType();
+    return returnType && (as<IRVoidType>(returnType) == nullptr);
+}
+
+static bool _isResourceWrite(IRCall* call)
+{
+    IRType* returnType = call->getDataType();
+    return returnType && (as<IRVoidType>(returnType) != nullptr);
+}
+
 const char* IntrinsicExpandContext::_emitSpecial(const char* cursor)
 {
     const char*const end = m_text.end();
@@ -323,13 +327,35 @@ const char* IntrinsicExpandContext::_emitSpecial(const char* cursor)
             // writes that will do a format conversion.
             if (m_emitter->getTarget() == CodeGenTarget::CUDASource)
             {
-                IRInst* arg0 = m_callInst->getArg(0);
+                IRInst* resourceInst = m_callInst->getArg(0);
 
-                if (IRFormatDecoration* formatDecoration = _findImageFormatDecoration(arg0))
+                if (IRFormatDecoration* formatDecoration = _findImageFormatDecoration(resourceInst))
                 {
                     const ImageFormat imageFormat = formatDecoration->getFormat();
-                    if (_isConvertRequired(imageFormat, arg0))
+                    if (_isConvertRequired(imageFormat, resourceInst))
                     {
+                        // If the function returns something it's a reader so we may need to convert
+                        // and in doing so require half
+                        if (_isResourceRead(m_callInst))
+                        {
+                            // If the source format if half derived, then we need to enable half
+                            switch (imageFormat)
+                            {
+                                case ImageFormat::r16f:
+                                case ImageFormat::rg16f:
+                                case ImageFormat::rgba16f:
+                                {
+                                    CUDAExtensionTracker* extensionTracker = as<CUDAExtensionTracker>(m_emitter->getExtensionTracker());
+                                    if (extensionTracker)
+                                    {
+                                        extensionTracker->requireBaseType(BaseType::Half);
+                                    }
+                                    break;
+                                }
+                                default: break;
+                            }
+                        }
+
                         // Append _convert on the name to signify we need to use a code path, that will automatically
                         // do the format conversion.
                         m_writer->emit("_convert");
@@ -344,7 +370,21 @@ const char* IntrinsicExpandContext::_emitSpecial(const char* cursor)
             /// Sometimes accesses need to be scaled. For example in CUDA the x coordinate for surface
             /// access is byte addressed.
             /// $E will return the byte size of the *backing element*.
-            size_t elemSizeInBytes = _calcBackingElementSizeInBytes(m_callInst->getArg(0));
+
+            IRInst* resourceInst = m_callInst->getArg(0);
+            size_t elemSizeInBytes = _calcBackingElementSizeInBytes(resourceInst);
+
+            // If we have a format converstion and its a *write* we don't need to scale
+            if (IRFormatDecoration* formatDecoration = _findImageFormatDecoration(resourceInst))
+            {
+                const ImageFormat imageFormat = formatDecoration->getFormat();
+                if (_isConvertRequired(imageFormat, resourceInst) && _isResourceWrite(m_callInst))
+                {
+                    // If there is a conversion *and* it's a write we don't need to scale.
+                    elemSizeInBytes = 1;
+                }
+            }
+
             SLANG_ASSERT(elemSizeInBytes > 0);
             m_writer->emitUInt64(UInt64(elemSizeInBytes));
             break;
Original file line number	Diff line number	Diff line change
`@@ -1083,7 +1083,7 @@ for (int tt = 0; tt < kBaseTextureTypeCount; ++tt)`
`1083`	`1083`	`}`
`1084`	`1084`
`1085`	`1085`	`sb << (isArray ? "Layered" : "");`
`1086`		`- sb << "read<$T0>($0";`
	`1086`	`+ sb << "read$C<$T0>($0";`
`1087`	`1087`
`1088`	`1088`	`for (int i = 0; i < vecCount; ++i)`
`1089`	`1089`	`{`