Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example pytorch_quantization doesn't shown speed up #3808

Closed
ecilay opened this issue Apr 21, 2024 · 14 comments
Closed

Example pytorch_quantization doesn't shown speed up #3808

ecilay opened this issue Apr 21, 2024 · 14 comments
Assignees
Labels
triaged Issue has been triaged by maintainers

Comments

@ecilay
Copy link

ecilay commented Apr 21, 2024

Description

I have been following this documentation to quantize the pretrained resnet, want to get a feel of how it works, however, the quantized resnet model is the same size of the pytorch model as well as runtime and memory, is this expected?
https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/index.html

Environment

Build pytorch_quantization from source as instructed in the RreadMe.

TensorRT Version:

NVIDIA GPU:

NVIDIA Driver Version: 525.105

CUDA Version: 12.1

CUDNN Version:

Operating System:

Python Version (if applicable): 3.10

Tensorflow Version (if applicable):

PyTorch Version (if applicable): 2.1.0+cu121

Baremetal or Container (if so, version):

@ecilay ecilay changed the title Documented example doesn't shown speed up Example pytorch_quantization doesn't shown speed up Apr 21, 2024
@ecilay
Copy link
Author

ecilay commented Apr 21, 2024

What's the difference between these two version of resnet?
Why use the model directly from pytorch doesn't show speed up?

@ecilay
Copy link
Author

ecilay commented Apr 21, 2024

I print the model precision, I find the model is actually still in fp32, this is the script I copied from the documentation, did I miss anything?

import torch
from pytorch_quantization import quant_modules
from pytorch_quantization import nn as quant_nn
from pytorch_quantization import calib
from pytorch_quantization.tensor_quant import QuantDescriptor

from torchvision.models import resnet50, ResNet50_Weights


total_batch = 512
num_tests = 5

device = torch.device("cuda:0")

quant_modules.initialize()
quant_desc_input = QuantDescriptor()
quant_nn.QuantConv2d.set_default_quant_desc_input(quant_desc_input)
quant_nn.QuantLinear.set_default_quant_desc_input(quant_desc_input)

weights = ResNet50_Weights.DEFAULT
model = resnet50(pretrained=True)
model.cuda()
model.eval()
for param in model.parameters():
    print(param.dtype)

weights = ResNet50_Weights.DEFAULT
preprocess = weights.transforms()

def collect_stats(model, image_dir="quantize_images"):
    """Feed data to the network and collect statistic"""

    # Enable calibrators
    for name, module in model.named_modules():
        if isinstance(module, quant_nn.TensorQuantizer):
             if module._calibrator is not None:
                 module.disable_quant()
                 module.enable_calib()
             else:
                 module.disable()

    inputs = torch.randint(0, 255, (num_tests ,total_batch, 3, 224, 224))
    inputs = [preprocess(_input) for _input in inputs]
    for _input in inputs:
        _ = model(_input.cuda()).squeeze(0).softmax(0)
        

     # Disable calibrators
    for name, module in model.named_modules():
        if isinstance(module, quant_nn.TensorQuantizer):
            if module._calibrator is not None:
                module.enable_quant()
                module.disable_calib()
            else:
                module.enable()

def compute_amax(model, **kwargs):
    # Load calib result
    for name, module in model.named_modules():
        if isinstance(module, quant_nn.TensorQuantizer):
            if module._calibrator is not None:
                if isinstance(module._calibrator, calib.MaxCalibrator):
                    module.load_calib_amax()
                else:
                    module.load_calib_amax(**kwargs)
            print(F"{name:40}: {module}")
    model.cuda()


with torch.no_grad():
    collect_stats(model)
    compute_amax(model, method="percentile", percentile=99.99)

torch.save(model.state_dict(), "quant.pt")
for param in model.parameters():
    print(param.dtype)

@lix19937
Copy link

lix19937 commented Apr 22, 2024

torch.save(model.state_dict(), "quant.pt")
for param in model.parameters():
print(param.dtype)

quant.pt has more scale layers than no_quant.pt, or you export .onnx file, use netron see will more clear.

param.dtype always fp32

@ecilay
Copy link
Author

ecilay commented Apr 22, 2024

Sorry I think you didn't quite answer my question...

Do I save the model not correctly? If so, how shall I save the quantized model in this case, so that next time, I can directly load the same modeling code with the new quantized checkpoint?

@lix19937
Copy link

lix19937 commented Apr 22, 2024

What's the difference between these two version of resnet?

quant version(insert Q-DQ) vs no quant version

Why use the model directly from pytorch doesn't show speed up?

quant version add more scale layers (mul op), nv pytorch_quantization tool just speed up the model run on tensorrt not pytorch.

I think you should read more about quant doc.

@ecilay
Copy link
Author

ecilay commented Apr 22, 2024

Okay...good to know, but tbh this is not explicitly stated in the documentation. If I missed it, would appreciate you kindly point out, thanks

just speed up the model run on tensorrt not pytorch.

@ecilay
Copy link
Author

ecilay commented Apr 22, 2024

Besides, I am trying to use the tool to quantize a latent encoder, this is the model code https://github.com/CompVis/latent-diffusion/blob/main/ldm/modules/diffusionmodules/model.py#L368. The inference code has this model as a pl.LightningModule module.
When I tried to quantize it, I get:

File "/home/lib/python3.10/site-packages/pytorch_quantization-2.2.0-py3.10-linux-x86_64.egg/pytorch_quantization/nn/modules/tensor_quantizer.py", line 260, in load_calib_amax
    raise RuntimeError(err_msg + " Passing 'strict=False' to `load_calib_amax()` will ignore the error.")
RuntimeError: Calibrator returned None. This usually happens when calibrator hasn't seen any tensor. Passing 'strict=False' to `load_calib_amax()` will ignore the error.
[2024-04-22 00:16:14,157] torch._dynamo.utils: [INFO] TorchDynamo compilation metrics:
[2024-04-22 00:16:14,157] torch._dynamo.utils: [INFO] Function    Runtimes (s)
[2024-04-22 00:16:14,157] torch._dynamo.utils: [INFO] ----------  --------------

Would you have some clue why this happens?

If I use the nn.Module, the quantize works, but the exported onnx produce pretty wrong results.

@ecilay
Copy link
Author

ecilay commented Apr 23, 2024

does the tool supports pl.LightningModule out of the box?

@lix19937
Copy link

Besides, I am trying to use the tool to quantize a latent encoder, this is the model code https://github.com/CompVis/latent-diffusion/blob/main/ldm/modules/diffusionmodules/model.py#L368. The inference code has this model as a pl.LightningModule module. When I tried to quantize it, I get:

File "/home/lib/python3.10/site-packages/pytorch_quantization-2.2.0-py3.10-linux-x86_64.egg/pytorch_quantization/nn/modules/tensor_quantizer.py", line 260, in load_calib_amax
    raise RuntimeError(err_msg + " Passing 'strict=False' to `load_calib_amax()` will ignore the error.")
RuntimeError: Calibrator returned None. This usually happens when calibrator hasn't seen any tensor. Passing 'strict=False' to `load_calib_amax()` will ignore the error.
[2024-04-22 00:16:14,157] torch._dynamo.utils: [INFO] TorchDynamo compilation metrics:
[2024-04-22 00:16:14,157] torch._dynamo.utils: [INFO] Function    Runtttimes (s)
[2024-04-22 00:16:14,157] torch._dynamo.utils: [INFO] ----------  --------------

Would you have some clue why this happens?

If I use the nn.Module, the quantize works, but the exported onnx produce pretty wrong results.

What is your qat.py or calib.py ?

@ecilay
Copy link
Author

ecilay commented Apr 23, 2024

This is all the code #3808 (comment), except I swapped the model to the latent encoder model, am I missing something?

@ecilay
Copy link
Author

ecilay commented Apr 23, 2024

Would you recommend for me to try use the latest AMMO from Nvidia?
Is pytorch_quantization still being actively developed?

@lix19937
Copy link

Would you recommend for me to try use the latest AMMO from Nvidia? Is pytorch_quantization still being actively developed?

AMMO is a library for conveniently optimizing and deploying efficient neural networks that can fit a wide range of Nvidia hardware. It is intended for ML engineers to efficiently design, train, and deploy their models on Nvidia from within their desired ML training framework, e.g., PyTorch.

ref https://pypi.org/project/nvidia-ammo/

I think ammo contains the pytorch-quanitzation functions and more other new features support LLM model(like transformer-based). TensorRT-LLM and diffusion model both use nvidia-ammo.

@zerollzeng
Copy link
Collaborator

We encourage use AMMO now and pytorch-quantization will be deprecated in the future.

@zerollzeng zerollzeng self-assigned this Apr 25, 2024
@zerollzeng zerollzeng added the triaged Issue has been triaged by maintainers label Apr 25, 2024
@ecilay ecilay closed this as completed May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

3 participants