TensorRT-LLM 0.10.0 Release #1735

kaiyux · 2024-06-05T13:02:35Z

kaiyux
Jun 5, 2024
Maintainer

Hi,

We are very pleased to announce the 0.10.0 version of TensorRT-LLM. It has been an intense effort, and we hope that it will enable you to easily deploy GPU-based inference for state-of-the-art LLMs. We want TensorRT-LLM to help you run those LLMs very fast.

This update includes:

Key Features and Enhancements

The Python high level API
- Added embedding parallel, embedding sharing, and fused MLP support.
- Enabled the usage of the executor API.
Added a weight-stripping feature with a new trtllm-refit command. For more information, refer to examples/sample_weight_stripping/README.md.
Added a weight-streaming feature. For more information, refer to docs/source/advanced/weight-streaming.md.
Enhanced the multiple profiles feature; --multiple_profiles argument in trtllm-build command builds more optimization profiles now for better performance.
Added FP8 quantization support for Mixtral.
Added support for pipeline parallelism for GPT.
Optimized applyBiasRopeUpdateKVCache kernel by avoiding re-computation.
Reduced overheads between enqueue calls of TensorRT engines.
Added support for paged KV cache for enc-dec models. The support is limited to beam width 1.
Added W4A(fp)8 CUTLASS kernels for the NVIDIA Ada Lovelace architecture.
Added debug options (--visualize_network and --dry_run) to the trtllm-build command to visualize the TensorRT network before engine build.
Integrated the new NVIDIA Hopper XQA kernels for LLaMA 2 70B model.
Improved the performance of pipeline parallelism when enabling in-flight batching.
Supported quantization for Nemotron models.
Added LoRA support for Qwen.
Added in-flight batching support for ChatGLM models.
Added support to ModelRunnerCpp so that it runs with the executor API for IFB-compatible models.
Enhanced the custom AllReduce by adding a heuristic; fall back to use native NCCL kernel when hardware requirements are not satisfied to get the best performance.
Optimized the performance of checkpoint conversion process for LLaMA.
Benchmark
- [BREAKING CHANGE] Moved the request rate generation arguments and logic from prepare dataset script to gptManagerBenchmark.
- Enabled streaming and support Time To the First Token (TTFT) latency and Inter-Token Latency (ITL) metrics for gptManagerBenchmark.
- Added the --max_attention_window option to gptManagerBenchmark.

API Changes

[BREAKING CHANGE] Set the default tokens_per_block argument of the trtllm-build command to 64 for better performance.
[BREAKING CHANGE] Migrated enc-dec models to the unified workflow.
[BREAKING CHANGE] Renamed GptModelConfig to ModelConfig.
[BREAKING CHANGE] Added speculative decoding mode to the builder API.
[BREAKING CHANGE] Refactor scheduling configurations
- Unified the SchedulerPolicy with the same name in batch_scheduler and executor, and renamed it to CapacitySchedulerPolicy.
- Expanded the existing configuration scheduling strategy from SchedulerPolicy to SchedulerConfig to enhance extensibility. The latter also introduces a chunk-based configuration called ContextChunkingPolicy.
[BREAKING CHANGE] The input prompt was removed from the generation output in the generate() and generate_async() APIs. For example, when given a prompt as A B, the original generation result could be <s>A B C D E where only C D E is the actual output, and now the result is C D E.
[BREAKING CHANGE] Switched default add_special_token in the TensorRT-LLM backend to True.
Deprecated GptSession and TrtGptModelV1.

Model Updates

Support DBRX
Support Qwen2
Support CogVLM
Support ByT5
Support LLaMA 3
Support Arctic (w/ FP8)
Support Fuyu
Support Persimmon
Support Deplot
Support Phi-3-Mini with long Rope
Support Neva
Support Kosmos-2
Support RecurrentGemma

Fixed Issues

Fixed some unexpected behaviors in beam search and early stopping, so that the outputs are more accurate.
Fixed segmentation fault with pipeline parallelism and gather_all_token_logits. (Segmentation fault with pipeline parallelism and gather_all_token_logits #1284)
Removed the unnecessary check in XQA to fix code Llama 70b Triton crashes. (Code Llama 70b triton crashes with XQA #1256)
Fixed an unsupported ScalarType issue for BF16 LoRA. (Support bfloat16 LoRa Adaptors triton-inference-server/tensorrtllm_backend#403)
Eliminated the load and save of prompt table in multimodal. (why is the `prompt_table` in ModelRunner.generate passed in as npy file instead of a tensor ? #1436)
Fixed an error when converting the models weights of Qwen 72B INT4-GPTQ. (Qwen-72B-chat-GPTQ TP=4 ERROR #1344)
Fixed early stopping and failures on in-flight batching cases of Medusa. (Fail to run Medusa IFB with triton inference server #1449)
Added support for more NVLink versions for auto parallelism. (KeyError: 6 when getting nvlink_bandwidth #1467)
Fixed the assert failure caused by default values of sampling config. ([TensorRT-LLM][ERROR] Assertion failed: hasValues == configValue.has_value() (/app/tensorrt_llm/cpp/include/tensorrt_llm/runtime/samplingConfig.h:46 #1447)
Fixed a requirement specification on Windows for nvidia-cudnn-cu12. (ImportError: DLL load failed while importing tensorrt #1446)
Fixed MMHA relative position calculation error in gpt_attention_plugin for enc-dec models. (Flan t5 xxl result large difference #1343)

Infrastructure changes

Base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:24.03-py3.
Base Docker image for TensorRT-LLM backend is updated to nvcr.io/nvidia/tritonserver:24.03-py3.
The dependent TensorRT version is updated to 10.0.1.
The dependent CUDA version is updated to 12.4.0.
The dependent PyTorch version is updated to 2.2.2.

Currently, there are two key branches in the project:

The rel branch is the stable branch for the release of TensorRT-LLM. It has been QA-ed and carefully tested.
The main branch is the dev branch. It is more experimental.

We are updating the main branch regularly with new features, bug fixes and performance optimizations. The rel branch will be updated less frequently, and the exact frequencies depend on your feedback.

Thanks,
The TensorRT-LLM Engineering Team

This discussion was created from the release TensorRT-LLM 0.10.0 Release.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorRT-LLM 0.10.0 Release #1735

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

TensorRT-LLM 0.10.0 Release #1735

kaiyux Jun 5, 2024 Maintainer

Key Features and Enhancements

API Changes

Model Updates

Fixed Issues

Infrastructure changes

Replies: 0 comments

kaiyux
Jun 5, 2024
Maintainer