[MXFP8] grad_output is quantized columnwise even if weight doesn't require gradients. #1693

kshitij12345 · 2025-04-17T12:45:39Z

https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html#Handling-transposes

Based on the diagram,

If weight is frozen (for eg. in LoRA setting), we can avoid quantizing grad_output in columnwise direction.

Using following patch seems to work

diff --git a/transformer_engine/pytorch/module/linear.py b/transformer_engine/pytorch/module/linear.py
index 83dc652c..3e58700a 100644
--- a/transformer_engine/pytorch/module/linear.py
+++ b/transformer_engine/pytorch/module/linear.py
@@ -423,6 +423,11 @@ class _Linear(torch.autograd.Function):
                     ub_obj_wgrad.set_buffer_params(ctx.grad_input_quantizer)
                     dgrad_bulk = ub_obj_wgrad.get_buffer(ctx.grad_input_quantizer)
 
+            if not ctx.requires_dgrad and ctx.grad_output_quantizer is not None:
+                ctx.grad_output_quantizer.set_usage(rowwise=False)
+            if not ctx.requires_wgrad and ctx.grad_output_quantizer is not None:
+                ctx.grad_output_quantizer.set_usage(columnwise=False)
+
             # Prepare grad output tensor
             # Note: Cast to expected dtype and perform tensor-parallel communication
             if ctx.grad_output_quantizer is not None:

Test Script

import torch

import transformer_engine
from transformer_engine.pytorch import fp8_autocast, Linear

dim = 1024 * 22  # Large input for demonstration of memory change.
linear = Linear(dim, dim, bias=False)
x = torch.randn(dim, dim, requires_grad=True, device="cuda")

linear.weight.requires_grad = False

with fp8_autocast():
    o = linear(x)
    g_o = torch.randn_like(o)

o.backward(g_o)

# Without patch - 12314.476544 MB
# With patch - 11790.188544 MB
print(torch.cuda.max_memory_allocated() / 1e6, "MB")

The text was updated successfully, but these errors were encountered:

ptrendx · 2025-04-25T00:17:30Z

Hi @kshitij12345 this makes perfect sense. The proposed solution looks good to me, could you create a PR with it?

kshitij12345 · 2025-04-29T08:38:07Z

Sure, I will have a PR up soon, thanks!

kshitij12345 added the bug Something isn't working label Apr 17, 2025

kshitij12345 linked a pull request Apr 30, 2025 that will close this issue

fix: update grad_output quant to avoid redundant work #1736

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MXFP8] grad_output is quantized columnwise even if weight doesn't require gradients. #1693

[MXFP8] grad_output is quantized columnwise even if weight doesn't require gradients. #1693

kshitij12345 commented Apr 17, 2025

ptrendx commented Apr 25, 2025

kshitij12345 commented Apr 29, 2025

[MXFP8] grad_output is quantized columnwise even if weight doesn't require gradients. #1693

[MXFP8] grad_output is quantized columnwise even if weight doesn't require gradients. #1693

Comments

kshitij12345 commented Apr 17, 2025

ptrendx commented Apr 25, 2025

kshitij12345 commented Apr 29, 2025