cedrickchee · Shwai-He · Oct 16, 2024 · Oct 16, 2024
diff --git a/README.md b/README.md
@@ -115,6 +115,7 @@ An awesome style list that curates the best machine learning model compression a
     > They propose `SparseGPT`, the first accurate one-shot pruning method which works efficiently at the scale of models with 10-100 billion parameters. `SparseGPT` works by reducing the pruning problem to an extremely large-scale instance of sparse regression. It is based on a new approximate sparse regression solver, used to solve a layer-wise compression problem, which is efficient enough to execute in a few hours on the largest openly-available GPT models (175B parameters), using a single GPU. At the same time, SparseGPT is accurate enough to drop negligible accuracy post-pruning, without any fine-tuning.
 - [UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers](https://arxiv.org/abs/2301.13741) by Tsinghua University et al. (ICML 2023) [[Code](https://github.com/sdc17/UPop)]
 - [A Simple and Effective Pruning Approach for Large Language Models](https://arxiv.org/abs/2306.11695) by CMU, Meta AI Research et al. (May 2024) - The popular approach known as magnitude pruning removes the smallest weights in a network based on the assumption that weights closest to 0 can be set to 0 with the least impact on performance. In LLMs, the magnitudes of a subset of outputs from an intermediate layer may be up to 20x larger than those of other outputs of the same layer. Removing the weights that are multiplied by these large outputs — even weights close to zero — could significantly degrade performance. Thus, a pruning technique that considers both weights and intermediate-layer outputs can accelerate a network with less impact on performance. Why it matters: The ability to compress models without affecting their performance is becoming more important as mobiles and personal computers become powerful enough to run them. [Code: [Wanda](https://github.com/locuslab/wanda)]
+- [What Matters in Transformers? Not All Attention is Needed](https://arxiv.org/abs/2406.15786) by UMD. (Jun, 2024) While Transformer-based LLMs show strong performance across tasks, they often include redundant components that hinder efficiency. We investigate redundancy in Blocks, MLP, and Attention layers using a similarity-based metric. Surprisingly, many attention layers exhibit high redundancy and can be pruned with minimal performance impact. For instance, pruning half of the attention layers in Llama-2-70B led to a 48.4% speedup with just a 2.4% performance drop. We also propose jointly pruning Attention and MLP layers, allowing more aggressive reduction with minimal loss. Our findings offer key insights for optimizing transformer efficiency. [Code: [LLM-Drop](https://github.com/CASE-Lab-UMD/LLM-Drop)]
 
 ### Distillation