perf: [draft] Add fused q_norm/k_norm/RoPE for Qwen3. #4482

bobboli · 2025-05-20T08:14:13Z

PR title

Please write the PR title by following template:

[JIRA ticket link/nvbug link/github issue link][fix/feat/doc/infra/...] <summary of this PR>

For example, assume I have a PR hope to support a new feature about cache manager of Jira TRTLLM-1000 ticket, it would be like

[TRTLLM-1000][feat] Support a new feature about cache manager

Description

Please explain the issue and the solution in short.

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

In this case, numElemsPerThread=2, numVecPerThread=0. But the store code incorrectly perform vectorized store, some threads (e.g., lane1) issue store to address that is not aligned to 64 bit. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

Cleanup code. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

byshiue · 2025-05-20T08:37:20Z

cpp/tensorrt_llm/kernels/fusedQKNormRopeKernel.cu

+    }
+
+    // Reduce sum across warp
+    for (int mask = 16; mask > 0; mask /= 2)


Can use warpReduceSum in cpp/tensorrt_llm/common/reduceKernelUtils.cuh directly.

cpp/tensorrt_llm/kernels/fusedQKNormRopeKernel.cu

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

bobboli added 10 commits May 16, 2025 03:17

Add Julien's origina kernel.

0294545

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

Get rid of UpdateKVCache functionality.

40fbbe2

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

Add kernels.

befe1aa

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

Add torch OP.

b76bb24

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

Update cmake.

f76de38

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

Torch OP must use double as argument dtype.

91ff8a0

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

Add unittest.

e461cf7

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

Add unittest.

39df73b

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

Remove unroll (compiler can do that).

756174e

Cleanup code. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

bobboli requested review from byshiue and jdemouth-nvidia May 20, 2025 08:14

byshiue reviewed May 20, 2025

View reviewed changes

cpp/tensorrt_llm/kernels/fusedQKNormRopeKernel.cu Show resolved Hide resolved

bobboli changed the title ~~[draft][perf] Add fused q_norm/k_norm/RoPE for Qwen3.~~ perf: [draft] Add fused q_norm/k_norm/RoPE for Qwen3. May 20, 2025

byshiue approved these changes May 20, 2025

View reviewed changes

Add switch for interleave.

18d665b

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: [draft] Add fused q_norm/k_norm/RoPE for Qwen3. #4482

perf: [draft] Add fused q_norm/k_norm/RoPE for Qwen3. #4482

bobboli commented May 20, 2025

byshiue May 20, 2025

perf: [draft] Add fused q_norm/k_norm/RoPE for Qwen3. #4482

Are you sure you want to change the base?

perf: [draft] Add fused q_norm/k_norm/RoPE for Qwen3. #4482

Conversation

bobboli commented May 20, 2025

PR title

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

byshiue May 20, 2025

Choose a reason for hiding this comment