Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to convert T5_v1.1_xxl from ONNX to TRT engine? #2167

Closed
TracelessLe opened this issue Jul 19, 2022 · 8 comments
Closed

How to convert T5_v1.1_xxl from ONNX to TRT engine? #2167

TracelessLe opened this issue Jul 19, 2022 · 8 comments

Comments

@TracelessLe
Copy link

TracelessLe commented Jul 19, 2022

Description

I try to use the sample script in TensorRT/demo/HuggingFace/notebooks/t5.ipynb to converte the google/t5-v1_1-xxl 11b model to onnx format and trt engine file. The pytorch->onnx step is ok, but when I try to load it and convert it to trt model, it always fail after running about 2h with error as below:

tensorrt_model_path:  ./models/google/t5-v1_1-xxl/tensorrt
[07/19/2022-11:52:45] [TRT] [W] onnx2trt_utils.cpp:366: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[07/19/2022-11:53:14] [TRT] [W] TensorRT was linked against cuBLAS/cuBLASLt 11.6.5 but loaded cuBLAS/cuBLASLt 11.5.1
[07/19/2022-13:36:09] [TRT] [E] 10: [optimizer.cpp::computeCosts::2011] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[(Unnamed Layer* 13) [Constant] + (Unnamed Layer* 14) [Shuffle]...Mul_1732]}.)
[07/19/2022-13:36:09] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::609] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed. )
[!] Invalid Engine. Please ensure the engine was built correctly
Traceback (most recent call last):
  File "t5_onnx2trt.py", line 60, in <module>
    ).as_trt_engine(os.path.join(tensorrt_model_path, encoder_onnx_model_fpath) + ".engine", profiles=[encoder_profile])
  File "/root/TensorRT-8.2.5.1/demo/HuggingFace/NNDF/models.py", line 426, in as_trt_engine
    profiles,
  File "/root/TensorRT-8.2.5.1/demo/HuggingFace/T5/export.py", line 293, in onnx_to_trt
    return super().onnx_to_trt(output_fpath, input_fpath, network_metadata, profiles)
  File "/root/TensorRT-8.2.5.1/demo/HuggingFace/NNDF/models.py", line 129, in onnx_to_trt    network_definition, config=self.trt_inference_config
  File "<string>", line 3, in func_impl
  File "/root/miniconda3/envs/trt/lib/python3.7/site-packages/polygraphy/backend/base/loader.py", line 41, in __call__
    return self.call_impl(*args, **kwargs)
  File "/root/miniconda3/envs/trt/lib/python3.7/site-packages/polygraphy/backend/trt/loader.py", line 645, in call_impl
    return engine_from_bytes(super().call_impl)
  File "<string>", line 3, in func_impl
  File "/root/miniconda3/envs/trt/lib/python3.7/site-packages/polygraphy/backend/base/loader.py", line 41, in __call__
    return self.call_impl(*args, **kwargs)
  File "/root/miniconda3/envs/trt/lib/python3.7/site-packages/polygraphy/backend/trt/loader.py", line 669, in call_impl
    buffer, owns_buffer = util.invoke_if_callable(self._serialized_engine)
  File "/root/miniconda3/envs/trt/lib/python3.7/site-packages/polygraphy/util/util.py", line 646, in invoke_if_callable
    ret = func(*args, **kwargs)
  File "/root/miniconda3/envs/trt/lib/python3.7/site-packages/polygraphy/backend/trt/loader.py", line 603, in call_impl
    G_LOGGER.critical("Invalid Engine. Please ensure the engine was built correctly")
  File "/root/miniconda3/envs/trt/lib/python3.7/site-packages/polygraphy/logger/logger.py", line 349, in critical
    raise PolygraphyException(message) from None
polygraphy.exception.exception.PolygraphyException: Invalid Engine. Please ensure the engine was built correctly

the code I use as below:

batch_size = 1
T5_VARIANT = 'google/t5-v1_1-xxl'
max_sequence_length = T5ModelTRTConfig.MAX_SEQUENCE_LENGTH[T5_VARIANT]

# Encoder optimization profiles
encoder_profile = Profile()
encoder_profile.add(
    "input_ids",
    min=(batch_size, 1),
    opt=(batch_size, max_sequence_length // 2),
    max=(batch_size, max_sequence_length),
)

encoder_onnx_model_fpath = "t5-xxl-encoder.onnx"
metadata=NetworkMetadata(variant=T5_VARIANT, precision=Precision(fp16=True), other=T5Metadata(kv_cache=False))
t5_trt_encoder_engine = T5EncoderONNXFile(
                os.path.join(onnx_model_path, encoder_onnx_model_fpath), metadata
            ).as_trt_engine(os.path.join(tensorrt_model_path, encoder_onnx_model_fpath) + ".engine", profiles=[encoder_profile])

Ps. I use the jupyter script to convert t5-small, t5-large and t5-3b with no problem, when I come to work with t5-v1.1-xxl, it always fail... :(

Environment

TensorRT Version: 8.2.5.1
NVIDIA GPU: Tested on A100 and 3090Ti
NVIDIA Driver Version: 470.57.02
CUDA Version: 11.4
CUDNN Version: 8.2
Operating System: Ubuntu 18.04
Python Version (if applicable): 3.7
Tensorflow Version (if applicable):
PyTorch Version (if applicable): 1.11.0+cu113
Baremetal or Container (if so, version):

Relevant Files

I find some similar problems such as #1686 and #1937 and #1917, and I try to increase the workspace but no effect。

I open the trt verbose print, the infomation in running as below:

t5xxl_trt_log.txt

Steps To Reproduce

@nrakltx
Copy link

nrakltx commented Nov 7, 2022

@TracelessLe Could you please share how you fixed this? I would really appreciate it!

@TracelessLe
Copy link
Author

@TracelessLe Could you please share how you fixed this? I would really appreciate it!

Hi @nrakltx , I just try to increase the workspace of TRT config (in the NNDF/models.py script) to 10 times of base as below, it succeeded:

From:

DEFAULT_TRT_WORKSPACE_MB = 3072

self.trt_inference_config = CreateConfig(
    tf32=True,
    fp16=network_metadata.precision.fp16,
    max_workspace_size=result.DEFAULT_TRT_WORKSPACE_MB * 1024 * 1024,
    profiles=profiles,
    obey_precision_constraints=result.use_obey_precision_constraints()
)

To:

DEFAULT_TRT_WORKSPACE_MB = 3072

self.trt_inference_config = CreateConfig(
    tf32=True,
    fp16=network_metadata.precision.fp16,
    max_workspace_size=result.DEFAULT_TRT_WORKSPACE_MB * 10 * 1024 * 1024,
    profiles=profiles,
    obey_precision_constraints=result.use_obey_precision_constraints()
)

@nrakltx
Copy link

nrakltx commented Nov 8, 2022

This is with FP32 and not FP16, correct?

@TracelessLe
Copy link
Author

This is with FP32 and not FP16, correct?

Sure, maybe some NAN errors will occur when using FP16 in T5 xxl model as said in:

  1. transformers issues
  2. huggingface discuss

You can have a try. :)

@nrakltx
Copy link

nrakltx commented Nov 8, 2022

Cool, so 30GB VRAM was enough for the FP32 T5 v1.1 XXL TensorRT engine building process?

@drxmy
Copy link

drxmy commented Jan 3, 2023

@TracelessLe Could you please share how you fixed this? I would really appreciate it!

Hi @nrakltx , I just try to increase the workspace of TRT config (in the NNDF/models.py script) to 10 times of base as below, it succeeded:

From:

DEFAULT_TRT_WORKSPACE_MB = 3072

self.trt_inference_config = CreateConfig(
    tf32=True,
    fp16=network_metadata.precision.fp16,
    max_workspace_size=result.DEFAULT_TRT_WORKSPACE_MB * 1024 * 1024,
    profiles=profiles,
    obey_precision_constraints=result.use_obey_precision_constraints()
)

To:

DEFAULT_TRT_WORKSPACE_MB = 3072

self.trt_inference_config = CreateConfig(
    tf32=True,
    fp16=network_metadata.precision.fp16,
    max_workspace_size=result.DEFAULT_TRT_WORKSPACE_MB * 10 * 1024 * 1024,
    profiles=profiles,
    obey_precision_constraints=result.use_obey_precision_constraints()
)

Did you use 80G or 40G A100? I tried increasing DEFAULT_TRT_WORKSPACE_MB. It gave a "OutOfMemory" msg with both 32g V100 and 40g A100.

@nrakltx
Copy link

nrakltx commented Jan 3, 2023

80GB, 40GB is not enough - my average VRAM usage was 45GB~ during compilation.
Do note that if you have access to both versions of the GPU, you can compile the engine with the 80GB and infer with the 40GB.

@drxmy
Copy link

drxmy commented Jan 3, 2023

80GB, 40GB is not enough - my average VRAM usage was 45GB~ during compilation. Do note that if you have access to both versions of the GPU, you can compile the engine with the 80GB and infer with the 40GB.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants