Skip to content

Add attention drop #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,7 @@ An awesome style list that curates the best machine learning model compression a
> They propose `SparseGPT`, the first accurate one-shot pruning method which works efficiently at the scale of models with 10-100 billion parameters. `SparseGPT` works by reducing the pruning problem to an extremely large-scale instance of sparse regression. It is based on a new approximate sparse regression solver, used to solve a layer-wise compression problem, which is efficient enough to execute in a few hours on the largest openly-available GPT models (175B parameters), using a single GPU. At the same time, SparseGPT is accurate enough to drop negligible accuracy post-pruning, without any fine-tuning.
- [UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers](https://arxiv.org/abs/2301.13741) by Tsinghua University et al. (ICML 2023) [[Code](https://github.com/sdc17/UPop)]
- [A Simple and Effective Pruning Approach for Large Language Models](https://arxiv.org/abs/2306.11695) by CMU, Meta AI Research et al. (May 2024) - The popular approach known as magnitude pruning removes the smallest weights in a network based on the assumption that weights closest to 0 can be set to 0 with the least impact on performance. In LLMs, the magnitudes of a subset of outputs from an intermediate layer may be up to 20x larger than those of other outputs of the same layer. Removing the weights that are multiplied by these large outputs — even weights close to zero — could significantly degrade performance. Thus, a pruning technique that considers both weights and intermediate-layer outputs can accelerate a network with less impact on performance. Why it matters: The ability to compress models without affecting their performance is becoming more important as mobiles and personal computers become powerful enough to run them. [Code: [Wanda](https://github.com/locuslab/wanda)]
- [What Matters in Transformers? Not All Attention is Needed](https://arxiv.org/abs/2406.15786) by UMD. (Jun, 2024) While Transformer-based LLMs show strong performance across tasks, they often include redundant components that hinder efficiency. We investigate redundancy in Blocks, MLP, and Attention layers using a similarity-based metric. Surprisingly, many attention layers exhibit high redundancy and can be pruned with minimal performance impact. For instance, pruning half of the attention layers in Llama-2-70B led to a 48.4% speedup with just a 2.4% performance drop. We also propose jointly pruning Attention and MLP layers, allowing more aggressive reduction with minimal loss. Our findings offer key insights for optimizing transformer efficiency. [Code: [LLM-Drop](https://github.com/CASE-Lab-UMD/LLM-Drop)]

### Distillation

Expand Down