hyperlinking, meta description etc.

mobiusml · Nov 3, 2023 · dcc6134 · dcc6134
1 parent 5fc07ea
commit dcc6134
Showing 1 changed file with 41 additions and 14 deletions.
diff --git a/index.html b/index.html
@@ -11,6 +11,31 @@
     <title>Low-Rank Pruning of Llama2</title>
     <link rel="stylesheet" type="text/css" href="styling.css">
     <link rel="icon" type="image/png" href="figs/aana_logo.png">
+
+    <meta name="description" content="An exploration of model pruning for machine learning, focusing on the reduction of model size and speed optimization for deployment on resource-constrained devices. Discusses structured and unstructured sparsity, low-rank pruning, and introduces a new rank reduction that is compatible with LoRA (Low-Rank Adaptation) approach for efficient training of large language models like LLama2-7B.">
+
+    <meta name="keywords" content="Model Pruning, Machine Learning, Low-Rank Pruning, Sparsity, LoRA, LLama2-7B, Model Compression, Singular Value Decomposition, Transformer Models, Neural Networks, AI Optimization">
+
+    <meta name="Hicham Badri and Appu Shaji" content="Mobius Labs GmbH">
+
+    <!-- Specific tags for Open Graph / social media sharing -->
+    <meta property="og:title" content="Low Rank Pruning of Llama2">
+    <meta property="og:description" content="An in-depth article discussing the intricacies of model pruning in machine learning, with a focus on low-rank techniques and their application in large language models for improved performance and efficiency.">
+    <meta property="og:image" content="https://mobiusml.github.io/low-rank-llama2/figs/pseudo-code.png">
+    <meta property="og:url" content="https://mobiusml.github.io/low-rank-llama2/">
+    <meta property="og:type" content="article">
+
+    <!-- Twitter Card data -->
+    <meta name="twitter:card" content="summary_large_image">
+    <meta name="twitter:title" content="Low Rank Pruning of Llama2">
+    <meta name="twitter:description" content="Discover the advanced strategies for model pruning in AI, highlighting low-rank pruning and sparsity-aware optimizations for large language models such as LLama2-7B.">
+    <meta name="twitter:image" content="https://mobiusml.github.io/low-rank-llama2/figs/pseudo-code.png">
+    <meta name="twitter:creator" content="@appughar">
+
+    <!-- Meta tags for article publishing date and modification date -->
+    <meta name="article:published_time" content="2023-11-03T08:00:00+00:00">
+    <meta name="article:modified_time" content="2023-11-03T09:00:00+00:00">
+
 
 </head>
 
@@ -25,14 +50,15 @@ <h1 class="page-title">Low-Rank Pruning of Llama2</h1>
                         class="highlight-gray">Hicham Badri</mark></a><mark class="highlight-gray">, </mark><a
                     href="https://scholar.google.com/citations?user=HxZDDzUAAAAJ&hl=en"><mark class="highlight-gray">Appu Shaji</mark></a><mark
                     class="highlight-gray"></mark></p>
-            <p><mark class="highlight-gray">Mobius Labs GmbH</mark></p>
+            <p><mark class="highlight-gray"><a href="https://www.mobiuslabs.com/"><mark
+                class="highlight-gray">Mobius Labs GmbH</mark></a></p>
             <hr  />
             <p>In the ever-evolving landscape of artificial intelligence (AI), one undeniable trend has emerged in recent years: the relentless growth in the size and complexity of machine learning models. More specifically, large language models (LLMs) that mainly rely on transformers as building blocks, are reaching a substantial number of parameters and require a significant amount of compute that is expected to increase with larger and larger models being released. 
             </p>
-            <p>In this article, we explore low-rankness as a pruning technique of the LLama2-7B base model. We show that, by splitting almost all the linear layer weights into low-rank pairs <em>without fine-tuning</em> and leveraging LoRA for custom training, we can achieve the following without <em>implementing custom kernels</em>:
+            <p>In this article, we explore low-rankness as a pruning technique of the <a href="https://huggingface.co/meta-llama/Llama-2-7b">LLama2-7B base model</a>. We show that, by splitting almost all the linear layer weights into low-rank pairs <em>without fine-tuning</em> and leveraging LoRA for custom training, we can achieve the following without <em>implementing custom kernels</em>:
                 <ul>
                     <li>~50% reduction in the number of parameters.</li>
-                    <li>Up to  ~50% faster training vs. bitsandbytes’s 8-bit quantization.</li>
+                    <li>Up to  ~50% faster training vs. <a href="https://github.com/TimDettmers/bitsandbytes">bitsandbytes’s</a> 8-bit quantization.</li>
                     <li>Up to ~1.25x inference speed-up.</li>
                 </ul>
 
@@ -72,6 +98,10 @@ <h1 class="page-title">Low-Rank Pruning of Llama2</h1>
                                 href="#dataset">Dataset Performance</a></div>
                         <div class="table_of_contents-item table_of_contents-indent-0"><a class="table_of_contents-link"
                                 href="conclusion">Conclusion</a></div>
+
+                        <hr />
+                        <div> Support code is available at <a href="https://github.com/mobiusml/low-rank-llama2/tree/main/code"><mark
+                            class="highlight-gray">https://github.com/mobiusml/low-rank-llama2/tree/main/code</mark></a></div>
                         <!-- <div class="table_of_contents-item table_of_contents-indent-1"><a class="table_of_contents-link"
                                 href="#291a3097-c118-4f5d-aad4-76df5b0640bf">Downstream Tasks</a></div>
                         <div class="table_of_contents-item table_of_contents-indent-1"><a class="table_of_contents-link"
@@ -91,14 +121,11 @@ <h2 id="intro" class="">Introduction</h2>
                     <p>In practice however, sparse pruning has many limitations. In order to achieve actual speed-up in practice, custom sparsity-aware matrix multiplication (<i>matmul</i>) operations are required. For the moment, this is only partially supported in <a href="https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/">Ampere GPUs</a> or on CPUs via <a href="https://neuralmagic.com/">NeuralMagic</a>  . In Pytorch, sparse matrix multiplication operations are not optimized. For example, there is no implementation available of the batched <i>matmul</i> operation with sparse matrices. Rewriting it with the existing operation requires some reshaping and the result is 2-3x slower performance.  
                     </p>
 
-                    <p>Structured sparsity on the other hand consists in discarding weights in a structured way. For instance, we can remove columns, remove channels, block matrices, etc. This way, in theory, the model can be pruned without requiring specialized software/hardware for optimized runtime. Some structured sparsity methods still require optimized software to achieve faster runtime. For example, block-sparsity requires implementing dedicated GPU kernels for block-sparse <i>matmul</i> such as <a href="https://openai.com/research/block-sparse-gpu-kernels">https://openai.com/research/block-sparse-gpu-kernels</a>.
+                    <p>Structured sparsity on the other hand consists in discarding weights in a structured way. For instance, we can remove columns, remove channels, block matrices, etc. This way, in theory, the model can be pruned without requiring specialized software/hardware for optimized runtime. Some structured sparsity methods still require optimized software to achieve faster runtime. For example, block-sparsity requires implementing dedicated GPU kernels for block-sparse <i>matmul</i> such as <a href="https://openai.com/research/block-sparse-gpu-kernels">OpenAI's Block-sparse GPU kernels</a>.
                     </p>
 
                     <p>In practice however, structured sparsity cannot be pushed too far without a larger drop in accuracy compared to unstructured sparsity. As a result, the performance gain is usually very limited. 
                     </p>
-
-
-
 
 
                     <h2 id="lowrankpruning" class="">Low Rank Pruning</h2>
@@ -189,32 +216,32 @@ <h2 id="dataset">Dataset Performance</h2>
                             <td><b>LLama2-7B pruned</b></td>
                         </tr>
                         <tr>
-                            <td>vicgalle/alpaca-gpt4</td>
+                            <td><a href="https://huggingface.co/datasets/vicgalle/alpaca-gpt4">vicgalle/alpaca-gpt4</a></td>
                             <td>3.49</td>
                             <td>4.11</td>                                   
                         </tr>
                         <tr>
-                            <td>databricks/databricks-dolly-15k</td>
+                            <td><a href="https://huggingface.co/datasets/databricks/databricks-dolly-15k">databricks/databricks-dolly-15k</a></td>
                             <td>4.13</td>
                             <td>5.86</td>
                         </tr>
                         <tr>
-                            <td>knkarthick/dialogsum</td>
+                            <td><a href="https://huggingface.co/datasets/knkarthick/dialogsum">knkarthick/dialogsum</a></td>
                             <td>3.78</td>
                             <td>4.82</td>
                         </tr>
                         <tr>
-                            <td>ArtifactAI/arxiv-math-instruct-50k</td>
+                            <td><a href="https://huggingface.co/datasets/ArtifactAI/arxiv-math-instruct-50k">ArtifactAI/arxiv-math-instruct-50k</a></td>
                             <td>3.08</td>
                             <td>3.73</td>
                         </tr>
                         <tr>
-                            <td>Open-Orca/OpenOrca - 100k </td>
+                            <td><a href="https://huggingface.co/datasets/Open-Orca/OpenOrca">Open-Orca/OpenOrca - 100k </a></td>
                             <td>3.51</td>
                             <td>4.27</td>
                         </tr>
                         <tr>
-                            <td>Open-Orca/OpenOrca - 1M </td>
+                            <td><a href="https://huggingface.co/datasets/Open-Orca/OpenOrca">Open-Orca/OpenOrca - 1M</a></td>
                             <td>-</td>
                             <td>3.43</td>
                         </tr>
@@ -229,7 +256,7 @@ <h2 id="dataset">Dataset Performance</h2>
 
                     <h2 id="conclusion">Conclusion</h2>
 
-                    <p>In this article, we've demonstrated the utility of low-rank pruning as an effective method for accelerating large language models like LLama2-7B. Unlike sparse pruning, which often requires custom hardware or software configurations to realize significant speed gains, low-rank pruning doesn't require specialized kernel operations and can seamlessly integrate with existing matrix multiplication (<i>matmul</i>) implementations.
+                    <p>In this article, we've demonstrated the utility of low-rank pruning as an effective method for accelerating large language models like LLama2-7B. Unlike sparse pruning, which often requires custom hardware or software configurations to realize significant speed gains, low-rank pruning doesn't require specialized kernel operations and can seamlessly integrate with existing matrix multiplication (<i><a href="https://pytorch.org/blog/inside-the-matrix/">matmul</a></i>) implementations.
                     </p>
 
                     <p>Nevertheless, there is ample scope for further refinements, and we aspire for this article to serve as an inspiration to the research community. We encourage researchers to embrace low-rank pruning and explore its synergistic potential when combined with other pruning and quantization techniques.