From 76b45b5cab791d0a31c8508c3ff313746b76faaa Mon Sep 17 00:00:00 2001 From: Appu Shaji Date: Thu, 2 Nov 2023 08:32:03 +0100 Subject: [PATCH] added hrefs --- index.html | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/index.html b/index.html index 5c34e19..86c1af9 100644 --- a/index.html +++ b/index.html @@ -902,10 +902,10 @@

Introduction

Model pruning refers to the process of removing redundant information from machine learning models to make them “leaner”. As a result, the pruned model is smaller in size and should run faster which is suitable for deployment on resource-constrained devices or in real-time applications. Pruning can be combined with other techniques such as quantization to further optimize runtime. The most popular pruning approaches are based on discarding neurons, layer channels or entire layers. This kind of pruning is referred to as “sparsification”.

-

In practice however, sparse pruning has many limitations. In order to achieve actual speed-up in practice, custom sparsity-aware matrix multiplication (matmul) operations are required. For the moment, this is only partially supported in Ampere GPUs (https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/) or on CPUs via NeuralMagic https://neuralmagic.com/ . In Pytorch, sparse matrix multiplication operations are not optimized. For example, there is no implementation available of the batched matmul operation with sparse matrices. Rewriting it with the existing operation requires some reshaping and the result is 2-3x slower performance. +

In practice however, sparse pruning has many limitations. In order to achieve actual speed-up in practice, custom sparsity-aware matrix multiplication (matmul) operations are required. For the moment, this is only partially supported in Ampere GPUs or on CPUs via NeuralMagic . In Pytorch, sparse matrix multiplication operations are not optimized. For example, there is no implementation available of the batched matmul operation with sparse matrices. Rewriting it with the existing operation requires some reshaping and the result is 2-3x slower performance.

-

Structured sparsity on the other hand consists in discarding weights in a structured way. For instance, we can remove columns, remove channels, block matrices, etc. This way, in theory, the model can be pruned without requiring specialized software/hardware for optimized runtime. Some structured sparsity methods still require optimized software to achieve faster runtime. For example, block-sparsity requires implementing dedicated GPU kernels for block-sparse matmul such as https://openai.com/research/block-sparse-gpu-kernels . +

Structured sparsity on the other hand consists in discarding weights in a structured way. For instance, we can remove columns, remove channels, block matrices, etc. This way, in theory, the model can be pruned without requiring specialized software/hardware for optimized runtime. Some structured sparsity methods still require optimized software to achieve faster runtime. For example, block-sparsity requires implementing dedicated GPU kernels for block-sparse matmul such as https://openai.com/research/block-sparse-gpu-kernels.

In practice however, structured sparsity cannot be pushed too far without a larger drop in accuracy compared to unstructured sparsity. As a result, the performance gain is usually very limited.