From 1b01670696d36efa42812a535bebf1f473315bd7 Mon Sep 17 00:00:00 2001 From: shwaihe Date: Wed, 16 Oct 2024 11:10:32 -0400 Subject: [PATCH 1/2] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 123e75e..c99e332 100644 --- a/README.md +++ b/README.md @@ -115,6 +115,7 @@ An awesome style list that curates the best machine learning model compression a > They propose `SparseGPT`, the first accurate one-shot pruning method which works efficiently at the scale of models with 10-100 billion parameters. `SparseGPT` works by reducing the pruning problem to an extremely large-scale instance of sparse regression. It is based on a new approximate sparse regression solver, used to solve a layer-wise compression problem, which is efficient enough to execute in a few hours on the largest openly-available GPT models (175B parameters), using a single GPU. At the same time, SparseGPT is accurate enough to drop negligible accuracy post-pruning, without any fine-tuning. - [UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers](https://arxiv.org/abs/2301.13741) by Tsinghua University et al. (ICML 2023) [[Code](https://github.com/sdc17/UPop)] - [A Simple and Effective Pruning Approach for Large Language Models](https://arxiv.org/abs/2306.11695) by CMU, Meta AI Research et al. (May 2024) - The popular approach known as magnitude pruning removes the smallest weights in a network based on the assumption that weights closest to 0 can be set to 0 with the least impact on performance. In LLMs, the magnitudes of a subset of outputs from an intermediate layer may be up to 20x larger than those of other outputs of the same layer. Removing the weights that are multiplied by these large outputs — even weights close to zero — could significantly degrade performance. Thus, a pruning technique that considers both weights and intermediate-layer outputs can accelerate a network with less impact on performance. Why it matters: The ability to compress models without affecting their performance is becoming more important as mobiles and personal computers become powerful enough to run them. [Code: [Wanda](https://github.com/locuslab/wanda)] +- [What Matters in Transformers? Not All Attention is Needed](https://arxiv.org/abs/2406.15786) by UMD. (Jun, 2024) While Transformer-based LLMs show strong performance across tasks, they often include redundant components that hinder efficiency. We investigate redundancy in Blocks, MLP, and Attention layers using a similarity-based metric. Surprisingly, many attention layers exhibit high redundancy and can be pruned with minimal performance impact. For instance, pruning half of the attention layers in Llama-2-70B led to a 48.4% speedup with just a 2.4% performance drop. We also propose jointly pruning Attention and MLP layers, allowing more aggressive reduction with minimal loss. Our findings offer key insights for optimizing transformer efficiency. [Code: LLM-Drop](https://github.com/CASE-Lab-UMD/LLM-Drop) ### Distillation From 75f290567e12bda316a456560e5cde45d5b3c2cc Mon Sep 17 00:00:00 2001 From: shwaihe Date: Wed, 16 Oct 2024 11:12:33 -0400 Subject: [PATCH 2/2] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index c99e332..e13242d 100644 --- a/README.md +++ b/README.md @@ -115,7 +115,7 @@ An awesome style list that curates the best machine learning model compression a > They propose `SparseGPT`, the first accurate one-shot pruning method which works efficiently at the scale of models with 10-100 billion parameters. `SparseGPT` works by reducing the pruning problem to an extremely large-scale instance of sparse regression. It is based on a new approximate sparse regression solver, used to solve a layer-wise compression problem, which is efficient enough to execute in a few hours on the largest openly-available GPT models (175B parameters), using a single GPU. At the same time, SparseGPT is accurate enough to drop negligible accuracy post-pruning, without any fine-tuning. - [UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers](https://arxiv.org/abs/2301.13741) by Tsinghua University et al. (ICML 2023) [[Code](https://github.com/sdc17/UPop)] - [A Simple and Effective Pruning Approach for Large Language Models](https://arxiv.org/abs/2306.11695) by CMU, Meta AI Research et al. (May 2024) - The popular approach known as magnitude pruning removes the smallest weights in a network based on the assumption that weights closest to 0 can be set to 0 with the least impact on performance. In LLMs, the magnitudes of a subset of outputs from an intermediate layer may be up to 20x larger than those of other outputs of the same layer. Removing the weights that are multiplied by these large outputs — even weights close to zero — could significantly degrade performance. Thus, a pruning technique that considers both weights and intermediate-layer outputs can accelerate a network with less impact on performance. Why it matters: The ability to compress models without affecting their performance is becoming more important as mobiles and personal computers become powerful enough to run them. [Code: [Wanda](https://github.com/locuslab/wanda)] -- [What Matters in Transformers? Not All Attention is Needed](https://arxiv.org/abs/2406.15786) by UMD. (Jun, 2024) While Transformer-based LLMs show strong performance across tasks, they often include redundant components that hinder efficiency. We investigate redundancy in Blocks, MLP, and Attention layers using a similarity-based metric. Surprisingly, many attention layers exhibit high redundancy and can be pruned with minimal performance impact. For instance, pruning half of the attention layers in Llama-2-70B led to a 48.4% speedup with just a 2.4% performance drop. We also propose jointly pruning Attention and MLP layers, allowing more aggressive reduction with minimal loss. Our findings offer key insights for optimizing transformer efficiency. [Code: LLM-Drop](https://github.com/CASE-Lab-UMD/LLM-Drop) +- [What Matters in Transformers? Not All Attention is Needed](https://arxiv.org/abs/2406.15786) by UMD. (Jun, 2024) While Transformer-based LLMs show strong performance across tasks, they often include redundant components that hinder efficiency. We investigate redundancy in Blocks, MLP, and Attention layers using a similarity-based metric. Surprisingly, many attention layers exhibit high redundancy and can be pruned with minimal performance impact. For instance, pruning half of the attention layers in Llama-2-70B led to a 48.4% speedup with just a 2.4% performance drop. We also propose jointly pruning Attention and MLP layers, allowing more aggressive reduction with minimal loss. Our findings offer key insights for optimizing transformer efficiency. [Code: [LLM-Drop](https://github.com/CASE-Lab-UMD/LLM-Drop)] ### Distillation