Example pytorch_quantization doesn't shown speed up #3808

ecilay · 2024-04-21T03:58:19Z

Description

I have been following this documentation to quantize the pretrained resnet, want to get a feel of how it works, however, the quantized resnet model is the same size of the pytorch model as well as runtime and memory, is this expected?
https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/index.html

Environment

Build pytorch_quantization from source as instructed in the RreadMe.

TensorRT Version:

NVIDIA GPU:

NVIDIA Driver Version: 525.105

CUDA Version: 12.1

CUDNN Version:

Operating System:

Python Version (if applicable): 3.10

Tensorflow Version (if applicable):

PyTorch Version (if applicable): 2.1.0+cu121

Baremetal or Container (if so, version):

lix19937 · 2024-04-21T14:12:09Z

You can use https://github.com/NVIDIA/TensorRT/blob/master/tools/pytorch-quantization/examples/torchvision/models/classification/resnet.py#L370 define a nn.module, then export a quant onnx, which different with https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py.

ecilay · 2024-04-21T20:44:35Z

What's the difference between these two version of resnet?
Why use the model directly from pytorch doesn't show speed up?

ecilay · 2024-04-21T21:01:34Z

I print the model precision, I find the model is actually still in fp32, this is the script I copied from the documentation, did I miss anything?

import torch
from pytorch_quantization import quant_modules
from pytorch_quantization import nn as quant_nn
from pytorch_quantization import calib
from pytorch_quantization.tensor_quant import QuantDescriptor

from torchvision.models import resnet50, ResNet50_Weights


total_batch = 512
num_tests = 5

device = torch.device("cuda:0")

quant_modules.initialize()
quant_desc_input = QuantDescriptor()
quant_nn.QuantConv2d.set_default_quant_desc_input(quant_desc_input)
quant_nn.QuantLinear.set_default_quant_desc_input(quant_desc_input)

weights = ResNet50_Weights.DEFAULT
model = resnet50(pretrained=True)
model.cuda()
model.eval()
for param in model.parameters():
    print(param.dtype)

weights = ResNet50_Weights.DEFAULT
preprocess = weights.transforms()

def collect_stats(model, image_dir="quantize_images"):
    """Feed data to the network and collect statistic"""

    # Enable calibrators
    for name, module in model.named_modules():
        if isinstance(module, quant_nn.TensorQuantizer):
             if module._calibrator is not None:
                 module.disable_quant()
                 module.enable_calib()
             else:
                 module.disable()

    inputs = torch.randint(0, 255, (num_tests ,total_batch, 3, 224, 224))
    inputs = [preprocess(_input) for _input in inputs]
    for _input in inputs:
        _ = model(_input.cuda()).squeeze(0).softmax(0)
        

     # Disable calibrators
    for name, module in model.named_modules():
        if isinstance(module, quant_nn.TensorQuantizer):
            if module._calibrator is not None:
                module.enable_quant()
                module.disable_calib()
            else:
                module.enable()

def compute_amax(model, **kwargs):
    # Load calib result
    for name, module in model.named_modules():
        if isinstance(module, quant_nn.TensorQuantizer):
            if module._calibrator is not None:
                if isinstance(module._calibrator, calib.MaxCalibrator):
                    module.load_calib_amax()
                else:
                    module.load_calib_amax(**kwargs)
            print(F"{name:40}: {module}")
    model.cuda()


with torch.no_grad():
    collect_stats(model)
    compute_amax(model, method="percentile", percentile=99.99)

torch.save(model.state_dict(), "quant.pt")
for param in model.parameters():
    print(param.dtype)

lix19937 · 2024-04-22T00:34:11Z

torch.save(model.state_dict(), "quant.pt")
for param in model.parameters():
print(param.dtype)

quant.pt has more scale layers than no_quant.pt, or you export .onnx file, use netron see will more clear.

param.dtype always fp32

ecilay · 2024-04-22T01:08:52Z

Sorry I think you didn't quite answer my question...

Do I save the model not correctly? If so, how shall I save the quantized model in this case, so that next time, I can directly load the same modeling code with the new quantized checkpoint?

lix19937 · 2024-04-22T01:37:49Z

What's the difference between these two version of resnet?

quant version(insert Q-DQ) vs no quant version

Why use the model directly from pytorch doesn't show speed up?

quant version add more scale layers (mul op), nv pytorch_quantization tool just speed up the model run on tensorrt not pytorch.

I think you should read more about quant doc.

ecilay · 2024-04-22T07:20:06Z

Okay...good to know, but tbh this is not explicitly stated in the documentation. If I missed it, would appreciate you kindly point out, thanks

just speed up the model run on tensorrt not pytorch.

ecilay · 2024-04-22T07:21:21Z

Besides, I am trying to use the tool to quantize a latent encoder, this is the model code https://github.com/CompVis/latent-diffusion/blob/main/ldm/modules/diffusionmodules/model.py#L368. The inference code has this model as a pl.LightningModule module.
When I tried to quantize it, I get:

File "/home/lib/python3.10/site-packages/pytorch_quantization-2.2.0-py3.10-linux-x86_64.egg/pytorch_quantization/nn/modules/tensor_quantizer.py", line 260, in load_calib_amax
    raise RuntimeError(err_msg + " Passing 'strict=False' to `load_calib_amax()` will ignore the error.")
RuntimeError: Calibrator returned None. This usually happens when calibrator hasn't seen any tensor. Passing 'strict=False' to `load_calib_amax()` will ignore the error.
[2024-04-22 00:16:14,157] torch._dynamo.utils: [INFO] TorchDynamo compilation metrics:
[2024-04-22 00:16:14,157] torch._dynamo.utils: [INFO] Function    Runtimes (s)
[2024-04-22 00:16:14,157] torch._dynamo.utils: [INFO] ----------  --------------

Would you have some clue why this happens?

If I use the nn.Module, the quantize works, but the exported onnx produce pretty wrong results.

ecilay · 2024-04-23T00:50:17Z

does the tool supports pl.LightningModule out of the box?

lix19937 · 2024-04-23T15:05:33Z

Besides, I am trying to use the tool to quantize a latent encoder, this is the model code https://github.com/CompVis/latent-diffusion/blob/main/ldm/modules/diffusionmodules/model.py#L368. The inference code has this model as a pl.LightningModule module. When I tried to quantize it, I get:
File "/home/lib/python3.10/site-packages/pytorch_quantization-2.2.0-py3.10-linux-x86_64.egg/pytorch_quantization/nn/modules/tensor_quantizer.py", line 260, in load_calib_amax
    raise RuntimeError(err_msg + " Passing 'strict=False' to `load_calib_amax()` will ignore the error.")
RuntimeError: Calibrator returned None. This usually happens when calibrator hasn't seen any tensor. Passing 'strict=False' to `load_calib_amax()` will ignore the error.
[2024-04-22 00:16:14,157] torch._dynamo.utils: [INFO] TorchDynamo compilation metrics:
[2024-04-22 00:16:14,157] torch._dynamo.utils: [INFO] Function    Runtttimes (s)
[2024-04-22 00:16:14,157] torch._dynamo.utils: [INFO] ----------  --------------
Would you have some clue why this happens?

If I use the nn.Module, the quantize works, but the exported onnx produce pretty wrong results.

What is your qat.py or calib.py ?

ecilay · 2024-04-23T17:00:17Z

This is all the code #3808 (comment), except I swapped the model to the latent encoder model, am I missing something?

ecilay · 2024-04-23T18:26:43Z

Would you recommend for me to try use the latest AMMO from Nvidia?
Is pytorch_quantization still being actively developed?

lix19937 · 2024-04-24T00:36:10Z

Would you recommend for me to try use the latest AMMO from Nvidia? Is pytorch_quantization still being actively developed?

AMMO is a library for conveniently optimizing and deploying efficient neural networks that can fit a wide range of Nvidia hardware. It is intended for ML engineers to efficiently design, train, and deploy their models on Nvidia from within their desired ML training framework, e.g., PyTorch.

ref https://pypi.org/project/nvidia-ammo/

I think ammo contains the pytorch-quanitzation functions and more other new features support LLM model(like transformer-based). TensorRT-LLM and diffusion model both use nvidia-ammo.

zerollzeng · 2024-04-25T15:01:17Z

We encourage use AMMO now and pytorch-quantization will be deprecated in the future.

ecilay changed the title ~~Documented example doesn't shown speed up~~ Example pytorch_quantization doesn't shown speed up Apr 21, 2024

zerollzeng self-assigned this Apr 25, 2024

zerollzeng added the triaged Issue has been triaged by maintainers label Apr 25, 2024

ecilay closed this as completed May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example pytorch_quantization doesn't shown speed up #3808

Example pytorch_quantization doesn't shown speed up #3808

ecilay commented Apr 21, 2024 •

edited

Loading

lix19937 commented Apr 21, 2024

ecilay commented Apr 21, 2024 •

edited

Loading

ecilay commented Apr 21, 2024 •

edited

Loading

lix19937 commented Apr 22, 2024 •

edited

Loading

ecilay commented Apr 22, 2024 •

edited

Loading

lix19937 commented Apr 22, 2024 •

edited

Loading

ecilay commented Apr 22, 2024

ecilay commented Apr 22, 2024 •

edited

Loading

ecilay commented Apr 23, 2024

lix19937 commented Apr 23, 2024

ecilay commented Apr 23, 2024 •

edited

Loading

ecilay commented Apr 23, 2024

lix19937 commented Apr 24, 2024

zerollzeng commented Apr 25, 2024

Example pytorch_quantization doesn't shown speed up #3808

Example pytorch_quantization doesn't shown speed up #3808

Comments

ecilay commented Apr 21, 2024 • edited Loading

Description

Environment

lix19937 commented Apr 21, 2024

ecilay commented Apr 21, 2024 • edited Loading

ecilay commented Apr 21, 2024 • edited Loading

lix19937 commented Apr 22, 2024 • edited Loading

ecilay commented Apr 22, 2024 • edited Loading

lix19937 commented Apr 22, 2024 • edited Loading

ecilay commented Apr 22, 2024

ecilay commented Apr 22, 2024 • edited Loading

ecilay commented Apr 23, 2024

lix19937 commented Apr 23, 2024

ecilay commented Apr 23, 2024 • edited Loading

ecilay commented Apr 23, 2024

lix19937 commented Apr 24, 2024

zerollzeng commented Apr 25, 2024

ecilay commented Apr 21, 2024 •

edited

Loading

ecilay commented Apr 21, 2024 •

edited

Loading

ecilay commented Apr 21, 2024 •

edited

Loading

lix19937 commented Apr 22, 2024 •

edited

Loading

ecilay commented Apr 22, 2024 •

edited

Loading

lix19937 commented Apr 22, 2024 •

edited

Loading

ecilay commented Apr 22, 2024 •

edited

Loading

ecilay commented Apr 23, 2024 •

edited

Loading