Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slang takes ~30x as long as shaderc to compile a simple compute shader #6358

Closed
NBickford-NV opened this issue Feb 14, 2025 · 10 comments
Closed
Assignees
Labels
goal:quality & productivity Quality issues and issues that impact our productivity coding day to day inside slang

Comments

@NBickford-NV
Copy link
Contributor

NBickford-NV commented Feb 14, 2025

Hi Slang team! I've been running into some issues affecting hot-reload workflows, where re-compiling small shaders is common.

The ToT version of Slang (as of 944c19b) takes 48-49 ms on my Windows computer to compile the following 841 bytes of source to SPIR-V. This imports no modules, does not include Slang global session creation time or I/O time, uses optimization level 0, and is averaged over 128 runs. I've included a benchmark you can use to reproduce this issue; more information about it below.

// shader.slang
struct PushConstantCompute
{
  uint64_t bufferAddress;
  uint     numVertices;
};

struct Vertex
{
  float3 position;
};

[[vk::push_constant]]
ConstantBuffer<PushConstantCompute> pushConst;

[shader("compute")]
[numthreads(256, 1, 1)]
void main(uint3 threadIdx : SV_DispatchThreadID)
{
  uint index = threadIdx.x;

  if(index >= pushConst.numVertices)
    return;

  Vertex* vertices = (Vertex*)pushConst.bufferAddress;

  float angle = (index + 1) * 2.3f;

  float3 vertex = vertices[index].position;

  float cosAngle = cos(angle);
  float sinAngle = sin(angle);
  float3x3 rotationMatrix = float3x3(
    cosAngle, -sinAngle, 0.0,
    sinAngle,  cosAngle, 0.0,
         0.0,       0.0, 1.0
  );

  float3 rotatedVertex = mul(rotationMatrix, vertex);

  vertices[index].position = rotatedVertex;
}

The options and targets used in my benchmark are:

    m_options = {{slang::CompilerOptionName::EmitSpirvDirectly, {slang::CompilerOptionValueKind::Int, 1}},        //
                 {slang::CompilerOptionName::VulkanUseEntryPointName, {slang::CompilerOptionValueKind::Int, 1}},  //
                 {slang::CompilerOptionName::Optimization, {slang::CompilerOptionValueKind::Int, 0}}};
    m_targets = {slang::TargetDesc{.format = SLANG_SPIRV, .profile = m_globalSession->findProfile("spirv_1_6")}};

In comparison, shaderc (using shaderc_shared from the 1.4.304.0 Vulkan SDK) compiles the GLSL equivalent in ~1.6 ms, about 30 times as quickly:

// shader.comp.glsl
#version 460
#extension GL_EXT_buffer_reference2 : require
#extension GL_EXT_scalar_block_layout : require
#extension GL_EXT_shader_explicit_arithmetic_types : require

layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;

struct PushConstantCompute
{
  uint64_t bufferAddress;
  uint     numVertices;
};

layout(push_constant, scalar) uniform PushConsts
{
  PushConstantCompute pushConst;
};

struct Vertex
{
  vec3 position;
};

layout(buffer_reference, scalar) buffer VertexBuffer
{
  Vertex data[];
};

void main()
{
  uint index = gl_GlobalInvocationID.x;

  if(index >= pushConst.numVertices)
    return;

  VertexBuffer vertices = VertexBuffer(pushConst.bufferAddress);

  float angle = (index + 1) * 2.3f;

  vec3 vertex = vertices.data[index].position;

  float cosAngle = cos(angle);
  float sinAngle = sin(angle);
  mat3 rotationMatrix = mat3(
    cosAngle, -sinAngle, 0.0,
    sinAngle,  cosAngle, 0.0,
         0.0,       0.0, 1.0
  );

  vec3 rotatedVertex = rotationMatrix * vertex;

  vertices.data[index].position = rotatedVertex;
}

The generated SPIR-V files are similar, although shaderc's is slightly larger.

I've put together a benchmark at https://github.com/NBickford-NV/slang-compile-timer to test this under controlled conditions. It first initializes each shader compiler, then times how long it takes to compile a shader to SPIR-V 128 times and averages the results. (Varying the number of repetitions doesn't change the result much, so the first compilation isn't significantly more expensive).

To build the benchmark (currently only tested on Windows), run:

git clone --recursive https://github.com/NBickford-NV/slang-compile-timer.git
cd slang-compile-timer
mkdir cmake_build
cd cmake_build
cmake ..
cmake --build . --parallel

I've included a Release binary compiled using Visual Studio 2022 17.12.3.

Then to benchmark Slang, run ./slang-compile-timer shader.slang:

Loaded shader.slang; 841 bytes.
Compiler initialization time: 262.936300 ms
Compiling 128 times...
Repetition 1
Repetition 2
Repetition 4
Repetition 8
Repetition 16
Repetition 32
Repetition 64
Repetition 128
Average compilation time: 48.232041 ms
SPIR-V output is 1512 bytes long.

And to benchmark shaderc, run ./slang-compile-timer --shaderc shader.comp.glsl:

Loaded shader.comp.glsl; 1117 bytes.
Compiler initialization time: 0.060100 ms
Compiling 128 times...
Repetition 1
Repetition 2
Repetition 4
Repetition 8
Repetition 16
Repetition 32
Repetition 64
Repetition 128
Average compilation time: 1.542900 ms
SPIR-V output is 2568 bytes long.

Thank you!

package.zip

@csyonghe
Copy link
Collaborator

Thanks for providing this benchmark, we will look into this issue.

@csyonghe csyonghe self-assigned this Feb 14, 2025
@csyonghe csyonghe added this to the Q1 2025 (Winter) milestone Feb 14, 2025
@csyonghe csyonghe added the goal:quality & productivity Quality issues and issues that impact our productivity coding day to day inside slang label Feb 14, 2025
@csyonghe
Copy link
Collaborator

I am able to improve the performance quite a bit in this benchmark in #6396, but I should also note that Slang will never be as fast as GLSL compilers, similar to how C++ can never be as fast as a C compiler, due to the more powerful type system and flexible compiler architecture.

There might still be 1 or things we can do from here to get another 20-30% speedup, but it is unlikely we are able to get any better performance in these small examples.

Note that there are a lot of components in the compiler that incurs a flat cost at the beginning that is supposed to be amortized out when compiling large code. The advanced type system in Slang often allows users to write more compact, generic code so all of these will help with the compile time when handling more complex application code.

In particular, Slang allows you to precompile modules into .slang-module files, so you never need to reparse the same module twice. In this example, if you first convert .slang to .slang-module file, and then generate code from there, you will be able to by-pass the front-end entirely and get much shorter compile times.

@NBickford-NV
Copy link
Contributor Author

Thank you @csyonghe! I'll build #6396 and verify the performance improvement.

I had some ideas around further optimizations (e.g. I saw malloc/free and dynamic_casts relatively high on the list when I ran a performance profile, so I was thinking about looking at mimalloc and the assembly MSVC's generating for dynamic_cast), but I'll need to re-run the profile to see if that's still the case.

Just to check my understanding, the Slang module system wouldn't help for small shaders like this, right?

@csyonghe
Copy link
Collaborator

If compile time is of concern, the idea is to always precompile all .slang files into slang-modules before your application starts,
and then make use of link-time specialization instead of preprocessor based specialization to completely remove type checking from the application runtime.

https://shader-slang.org/slang/user-guide/link-time-specialization.html

@NBickford-NV
Copy link
Contributor Author

Thank you @csyonghe ! I now get an average compilation time of 14-16 ms on bca772c , ~3x as fast as before. That's a good performance improvement!

In case it helps, the flame graph I get for this looks like this:

Image

And the bottom-up view of time spent inside each function, excluding subfunctions:

Image

@NBickford-NV
Copy link
Contributor Author

I took a quick look into dynamicCast -- looks like the codegen there is OK (although moving IRInst::getOperands() into the header since LTO didn't inline it improves performance by about 4%, from 14.5 ms to 13.8 ms on my system), the main issue is that dynamicCast is called a lot -- 275041 times per compilation. 73125 or 26% of these return at the nullptr check; 62700 or 23% of these return because T::isaImpl() succeeds; all but 88 or 0.03% return at the final nullptr, although the unwrap is checked 137244 times. So the most major performance improvements will probably be algorithmic.

@csyonghe
Copy link
Collaborator

csyonghe commented Mar 1, 2025

Yes, this is consistent with the profiling results I've been seeing, and there is no obvious bottleneck in the system.

I am not identifying any low hanging fruit here that will significantly improve performance from the current state.

Since it is not clear if there are any actionable item, would you mind if we close this issue? I think our users can avoid paying front-end cost anyways if they can architect their code to use Slang modules.

@NBickford-NV
Copy link
Contributor Author

Sure, this can be closed; I feel like there's probably more to find here (should compiling a file this small require 275K dynamic casts?), but this speedup is good to see. I'll create another issue for the larger-scale benchmark testing out modules if I find issues there. Thanks!

@csyonghe
Copy link
Collaborator

csyonghe commented Mar 1, 2025

I agree that there is definitely more optimizations there, but it is unlikely that a single change or fix is going to make thing significantly different, and ROI will be diminishing.

Just to show why the type system is complex, here is an example of what needs to happen when checking a simple x+1 expression:

  1. Check the type of x, and find it to be float.
  2. Lookup for all overloads of +, there are like 30+ overloads in standard library for operator +.
  3. For generic operator+ candidates, check if the generic can be specialized with the operand types. Which means building the type inheritance list of all argument types, a complicated process that needs to take into account all potential extensions that may apply to float.
  4. After all that is done, compute the type coercion cost for each candidate, and pick the best option.

Step 3 there can spawn a lot of additonal checks, because Slang allow things like:

extension<T: IInterface1> T : IInterface2 {}

That makes all types conform to IInterface1 also conform to IInterface2, so if we have something like:

T operator+<T:IInterface2>(T, T)

To know if this candidate is applicable, we need not only know if an argument type is IInterface1, but also need to know if any extensions exist to make it conform to IInterface2, etc.

Compared to GLSL which does not have generics, the checking step will be much simpler.

There are certainly more optimizations we can do algorithmically to make things fast, in fact Slang already uses some caches to hold checked subtype relationships and operator overload resolution results, but there can be more opportunities for optimization.

But this is just to give you an idea of the complexity of the type system, and hopefully it explains why Slang's type checking is going to take more time than the GLSL compiler.

@csyonghe
Copy link
Collaborator

csyonghe commented Mar 1, 2025

I am going to close the issue now, but I am happy to work with anyone who are interested in optimizing the compiler to see if there are more we can do here.

@csyonghe csyonghe closed this as completed Mar 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
goal:quality & productivity Quality issues and issues that impact our productivity coding day to day inside slang
Projects
None yet
Development

No branches or pull requests

2 participants