Have trouble understanding block-wise attention transfer. #10

xUhEngwAng · 2024-11-28T09:49:50Z

Hi, thanks for your excellent work and the source-sourced code, really appreciate it! However, I'm having trouble understanding block-wise attention transfer. As stated in your paper, the block-wise approach is proposed to improve linearization quality at large model, as the MSE loss of latter layers will dominate the training process in the joint setting.
As far as I'm concerned, the input hidden states for each layer come from the outputs from the "softmax branch" of the previous layer, which presumably requires no grad. As a result, the backward gradient of one layer's attention transfer loss, cannot have an impact on its previous layers. That is, the q/k_process of each layer are already trained independently in the joint form.
It stands a good change many of my above statements are wrong, and I'll really appreciate it if you can solve my concerns.

simran-arora · 2024-12-24T21:01:35Z

Hi! Yes, each layer is trained independently! This is why we have a post-LoRA stage to "stitch the layers back together".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Have trouble understanding block-wise attention transfer. #10

Have trouble understanding block-wise attention transfer. #10

xUhEngwAng commented Nov 28, 2024

simran-arora commented Dec 24, 2024

Have trouble understanding block-wise attention transfer. #10

Have trouble understanding block-wise attention transfer. #10

Comments

xUhEngwAng commented Nov 28, 2024

simran-arora commented Dec 24, 2024