Skip to content

Commit

Permalink
hyperlinking, meta description etc.
Browse files Browse the repository at this point in the history
  • Loading branch information
appoose committed Nov 3, 2023
1 parent 5fc07ea commit dcc6134
Showing 1 changed file with 41 additions and 14 deletions.
55 changes: 41 additions & 14 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,31 @@
<title>Low-Rank Pruning of Llama2</title>
<link rel="stylesheet" type="text/css" href="styling.css">
<link rel="icon" type="image/png" href="figs/aana_logo.png">

<meta name="description" content="An exploration of model pruning for machine learning, focusing on the reduction of model size and speed optimization for deployment on resource-constrained devices. Discusses structured and unstructured sparsity, low-rank pruning, and introduces a new rank reduction that is compatible with LoRA (Low-Rank Adaptation) approach for efficient training of large language models like LLama2-7B.">

<meta name="keywords" content="Model Pruning, Machine Learning, Low-Rank Pruning, Sparsity, LoRA, LLama2-7B, Model Compression, Singular Value Decomposition, Transformer Models, Neural Networks, AI Optimization">

<meta name="Hicham Badri and Appu Shaji" content="Mobius Labs GmbH">

<!-- Specific tags for Open Graph / social media sharing -->
<meta property="og:title" content="Low Rank Pruning of Llama2">
<meta property="og:description" content="An in-depth article discussing the intricacies of model pruning in machine learning, with a focus on low-rank techniques and their application in large language models for improved performance and efficiency.">
<meta property="og:image" content="https://mobiusml.github.io/low-rank-llama2/figs/pseudo-code.png">
<meta property="og:url" content="https://mobiusml.github.io/low-rank-llama2/">
<meta property="og:type" content="article">

<!-- Twitter Card data -->
<meta name="twitter:card" content="summary_large_image">
<meta name="twitter:title" content="Low Rank Pruning of Llama2">
<meta name="twitter:description" content="Discover the advanced strategies for model pruning in AI, highlighting low-rank pruning and sparsity-aware optimizations for large language models such as LLama2-7B.">
<meta name="twitter:image" content="https://mobiusml.github.io/low-rank-llama2/figs/pseudo-code.png">
<meta name="twitter:creator" content="@appughar">

<!-- Meta tags for article publishing date and modification date -->
<meta name="article:published_time" content="2023-11-03T08:00:00+00:00">
<meta name="article:modified_time" content="2023-11-03T09:00:00+00:00">


</head>

Expand All @@ -25,14 +50,15 @@ <h1 class="page-title">Low-Rank Pruning of Llama2</h1>
class="highlight-gray">Hicham Badri</mark></a><mark class="highlight-gray">, </mark><a
href="https://scholar.google.com/citations?user=HxZDDzUAAAAJ&hl=en"><mark class="highlight-gray">Appu Shaji</mark></a><mark
class="highlight-gray"></mark></p>
<p><mark class="highlight-gray">Mobius Labs GmbH</mark></p>
<p><mark class="highlight-gray"><a href="https://www.mobiuslabs.com/"><mark
class="highlight-gray">Mobius Labs GmbH</mark></a></p>
<hr />
<p>In the ever-evolving landscape of artificial intelligence (AI), one undeniable trend has emerged in recent years: the relentless growth in the size and complexity of machine learning models. More specifically, large language models (LLMs) that mainly rely on transformers as building blocks, are reaching a substantial number of parameters and require a significant amount of compute that is expected to increase with larger and larger models being released.
</p>
<p>In this article, we explore low-rankness as a pruning technique of the LLama2-7B base model. We show that, by splitting almost all the linear layer weights into low-rank pairs <em>without fine-tuning</em> and leveraging LoRA for custom training, we can achieve the following without <em>implementing custom kernels</em>:
<p>In this article, we explore low-rankness as a pruning technique of the <a href="https://huggingface.co/meta-llama/Llama-2-7b">LLama2-7B base model</a>. We show that, by splitting almost all the linear layer weights into low-rank pairs <em>without fine-tuning</em> and leveraging LoRA for custom training, we can achieve the following without <em>implementing custom kernels</em>:
<ul>
<li>~50% reduction in the number of parameters.</li>
<li>Up to ~50% faster training vs. bitsandbytes’s 8-bit quantization.</li>
<li>Up to ~50% faster training vs. <a href="https://github.com/TimDettmers/bitsandbytes">bitsandbytes’s</a> 8-bit quantization.</li>
<li>Up to ~1.25x inference speed-up.</li>
</ul>

Expand Down Expand Up @@ -72,6 +98,10 @@ <h1 class="page-title">Low-Rank Pruning of Llama2</h1>
href="#dataset">Dataset Performance</a></div>
<div class="table_of_contents-item table_of_contents-indent-0"><a class="table_of_contents-link"
href="conclusion">Conclusion</a></div>

<hr />
<div> Support code is available at <a href="https://github.com/mobiusml/low-rank-llama2/tree/main/code"><mark
class="highlight-gray">https://github.com/mobiusml/low-rank-llama2/tree/main/code</mark></a></div>
<!-- <div class="table_of_contents-item table_of_contents-indent-1"><a class="table_of_contents-link"
href="#291a3097-c118-4f5d-aad4-76df5b0640bf">Downstream Tasks</a></div>
<div class="table_of_contents-item table_of_contents-indent-1"><a class="table_of_contents-link"
Expand All @@ -91,14 +121,11 @@ <h2 id="intro" class="">Introduction</h2>
<p>In practice however, sparse pruning has many limitations. In order to achieve actual speed-up in practice, custom sparsity-aware matrix multiplication (<i>matmul</i>) operations are required. For the moment, this is only partially supported in <a href="https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/">Ampere GPUs</a> or on CPUs via <a href="https://neuralmagic.com/">NeuralMagic</a> . In Pytorch, sparse matrix multiplication operations are not optimized. For example, there is no implementation available of the batched <i>matmul</i> operation with sparse matrices. Rewriting it with the existing operation requires some reshaping and the result is 2-3x slower performance.
</p>

<p>Structured sparsity on the other hand consists in discarding weights in a structured way. For instance, we can remove columns, remove channels, block matrices, etc. This way, in theory, the model can be pruned without requiring specialized software/hardware for optimized runtime. Some structured sparsity methods still require optimized software to achieve faster runtime. For example, block-sparsity requires implementing dedicated GPU kernels for block-sparse <i>matmul</i> such as <a href="https://openai.com/research/block-sparse-gpu-kernels">https://openai.com/research/block-sparse-gpu-kernels</a>.
<p>Structured sparsity on the other hand consists in discarding weights in a structured way. For instance, we can remove columns, remove channels, block matrices, etc. This way, in theory, the model can be pruned without requiring specialized software/hardware for optimized runtime. Some structured sparsity methods still require optimized software to achieve faster runtime. For example, block-sparsity requires implementing dedicated GPU kernels for block-sparse <i>matmul</i> such as <a href="https://openai.com/research/block-sparse-gpu-kernels">OpenAI's Block-sparse GPU kernels</a>.
</p>

<p>In practice however, structured sparsity cannot be pushed too far without a larger drop in accuracy compared to unstructured sparsity. As a result, the performance gain is usually very limited.
</p>





<h2 id="lowrankpruning" class="">Low Rank Pruning</h2>
Expand Down Expand Up @@ -189,32 +216,32 @@ <h2 id="dataset">Dataset Performance</h2>
<td><b>LLama2-7B pruned</b></td>
</tr>
<tr>
<td>vicgalle/alpaca-gpt4</td>
<td><a href="https://huggingface.co/datasets/vicgalle/alpaca-gpt4">vicgalle/alpaca-gpt4</a></td>
<td>3.49</td>
<td>4.11</td>
</tr>
<tr>
<td>databricks/databricks-dolly-15k</td>
<td><a href="https://huggingface.co/datasets/databricks/databricks-dolly-15k">databricks/databricks-dolly-15k</a></td>
<td>4.13</td>
<td>5.86</td>
</tr>
<tr>
<td>knkarthick/dialogsum</td>
<td><a href="https://huggingface.co/datasets/knkarthick/dialogsum">knkarthick/dialogsum</a></td>
<td>3.78</td>
<td>4.82</td>
</tr>
<tr>
<td>ArtifactAI/arxiv-math-instruct-50k</td>
<td><a href="https://huggingface.co/datasets/ArtifactAI/arxiv-math-instruct-50k">ArtifactAI/arxiv-math-instruct-50k</a></td>
<td>3.08</td>
<td>3.73</td>
</tr>
<tr>
<td>Open-Orca/OpenOrca - 100k </td>
<td><a href="https://huggingface.co/datasets/Open-Orca/OpenOrca">Open-Orca/OpenOrca - 100k </a></td>
<td>3.51</td>
<td>4.27</td>
</tr>
<tr>
<td>Open-Orca/OpenOrca - 1M </td>
<td><a href="https://huggingface.co/datasets/Open-Orca/OpenOrca">Open-Orca/OpenOrca - 1M</a></td>
<td>-</td>
<td>3.43</td>
</tr>
Expand All @@ -229,7 +256,7 @@ <h2 id="dataset">Dataset Performance</h2>

<h2 id="conclusion">Conclusion</h2>

<p>In this article, we've demonstrated the utility of low-rank pruning as an effective method for accelerating large language models like LLama2-7B. Unlike sparse pruning, which often requires custom hardware or software configurations to realize significant speed gains, low-rank pruning doesn't require specialized kernel operations and can seamlessly integrate with existing matrix multiplication (<i>matmul</i>) implementations.
<p>In this article, we've demonstrated the utility of low-rank pruning as an effective method for accelerating large language models like LLama2-7B. Unlike sparse pruning, which often requires custom hardware or software configurations to realize significant speed gains, low-rank pruning doesn't require specialized kernel operations and can seamlessly integrate with existing matrix multiplication (<i><a href="https://pytorch.org/blog/inside-the-matrix/">matmul</a></i>) implementations.
</p>

<p>Nevertheless, there is ample scope for further refinements, and we aspire for this article to serve as an inspiration to the research community. We encourage researchers to embrace low-rank pruning and explore its synergistic potential when combined with other pruning and quantization techniques.
Expand Down

0 comments on commit dcc6134

Please sign in to comment.