Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have trouble understanding block-wise attention transfer. #10

Open
xUhEngwAng opened this issue Nov 28, 2024 · 1 comment
Open

Have trouble understanding block-wise attention transfer. #10

xUhEngwAng opened this issue Nov 28, 2024 · 1 comment

Comments

@xUhEngwAng
Copy link

Hi, thanks for your excellent work and the source-sourced code, really appreciate it! However, I'm having trouble understanding block-wise attention transfer. As stated in your paper, the block-wise approach is proposed to improve linearization quality at large model, as the MSE loss of latter layers will dominate the training process in the joint setting.
As far as I'm concerned, the input hidden states for each layer come from the outputs from the "softmax branch" of the previous layer, which presumably requires no grad. As a result, the backward gradient of one layer's attention transfer loss, cannot have an impact on its previous layers. That is, the q/k_process of each layer are already trained independently in the joint form.
It stands a good change many of my above statements are wrong, and I'll really appreciate it if you can solve my concerns.

@simran-arora
Copy link
Contributor

Hi! Yes, each layer is trained independently! This is why we have a post-LoRA stage to "stitch the layers back together".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants