You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, thanks for your excellent work and the source-sourced code, really appreciate it! However, I'm having trouble understanding block-wise attention transfer. As stated in your paper, the block-wise approach is proposed to improve linearization quality at large model, as the MSE loss of latter layers will dominate the training process in the joint setting.
As far as I'm concerned, the input hidden states for each layer come from the outputs from the "softmax branch" of the previous layer, which presumably requires no grad. As a result, the backward gradient of one layer's attention transfer loss, cannot have an impact on its previous layers. That is, the q/k_process of each layer are already trained independently in the joint form.
It stands a good change many of my above statements are wrong, and I'll really appreciate it if you can solve my concerns.
The text was updated successfully, but these errors were encountered:
Hi, thanks for your excellent work and the source-sourced code, really appreciate it! However, I'm having trouble understanding block-wise attention transfer. As stated in your paper, the block-wise approach is proposed to improve linearization quality at large model, as the MSE loss of latter layers will dominate the training process in the joint setting.
As far as I'm concerned, the input hidden states for each layer come from the outputs from the "softmax branch" of the previous layer, which presumably requires no grad. As a result, the backward gradient of one layer's attention transfer loss, cannot have an impact on its previous layers. That is, the q/k_process of each layer are already trained independently in the joint form.
It stands a good change many of my above statements are wrong, and I'll really appreciate it if you can solve my concerns.
The text was updated successfully, but these errors were encountered: