TensorRT-LLM 0.10.0 Release
Hi,
We are very pleased to announce the 0.10.0 version of TensorRT-LLM. It has been an intense effort, and we hope that it will enable you to easily deploy GPU-based inference for state-of-the-art LLMs. We want TensorRT-LLM to help you run those LLMs very fast.
This update includes:
Key Features and Enhancements
- The Python high level API
- Added embedding parallel, embedding sharing, and fused MLP support.
- Enabled the usage of the
executor
API.
- Added a weight-stripping feature with a new
trtllm-refit
command. For more information, refer toexamples/sample_weight_stripping/README.md
. - Added a weight-streaming feature. For more information, refer to
docs/source/advanced/weight-streaming.md
. - Enhanced the multiple profiles feature;
--multiple_profiles
argument intrtllm-build
command builds more optimization profiles now for better performance. - Added FP8 quantization support for Mixtral.
- Added support for pipeline parallelism for GPT.
- Optimized
applyBiasRopeUpdateKVCache
kernel by avoiding re-computation. - Reduced overheads between
enqueue
calls of TensorRT engines. - Added support for paged KV cache for enc-dec models. The support is limited to beam width 1.
- Added W4A(fp)8 CUTLASS kernels for the NVIDIA Ada Lovelace architecture.
- Added debug options (
--visualize_network
and--dry_run
) to thetrtllm-build
command to visualize the TensorRT network before engine build. - Integrated the new NVIDIA Hopper XQA kernels for LLaMA 2 70B model.
- Improved the performance of pipeline parallelism when enabling in-flight batching.
- Supported quantization for Nemotron models.
- Added LoRA support for Mixtral and Qwen.
- Added in-flight batching support for ChatGLM models.
- Added support to
ModelRunnerCpp
so that it runs with theexecutor
API for IFB-compatible models. - Enhanced the custom
AllReduce
by adding a heuristic; fall back to use native NCCL kernel when hardware requirements are not satisfied to get the best performance. - Optimized the performance of checkpoint conversion process for LLaMA.
- Benchmark
- [BREAKING CHANGE] Moved the request rate generation arguments and logic from prepare dataset script to
gptManagerBenchmark
. - Enabled streaming and support
Time To the First Token (TTFT)
latency andInter-Token Latency (ITL)
metrics forgptManagerBenchmark
. - Added the
--max_attention_window
option togptManagerBenchmark
.
- [BREAKING CHANGE] Moved the request rate generation arguments and logic from prepare dataset script to
API Changes
- [BREAKING CHANGE] Set the default
tokens_per_block
argument of thetrtllm-build
command to 64 for better performance. - [BREAKING CHANGE] Migrated enc-dec models to the unified workflow.
- [BREAKING CHANGE] Renamed
GptModelConfig
toModelConfig
. - [BREAKING CHANGE] Added speculative decoding mode to the builder API.
- [BREAKING CHANGE] Refactor scheduling configurations
- Unified the
SchedulerPolicy
with the same name inbatch_scheduler
andexecutor
, and renamed it toCapacitySchedulerPolicy
. - Expanded the existing configuration scheduling strategy from
SchedulerPolicy
toSchedulerConfig
to enhance extensibility. The latter also introduces a chunk-based configuration calledContextChunkingPolicy
.
- Unified the
- [BREAKING CHANGE] The input prompt was removed from the generation output in the
generate()
andgenerate_async()
APIs. For example, when given a prompt asA B
, the original generation result could be<s>A B C D E
where onlyC D E
is the actual output, and now the result isC D E
. - [BREAKING CHANGE] Switched default
add_special_token
in the TensorRT-LLM backend toTrue
. - Deprecated
GptSession
andTrtGptModelV1
.
Model Updates
- Support DBRX
- Support Qwen2
- Support CogVLM
- Support ByT5
- Support LLaMA 3
- Support Arctic (w/ FP8)
- Support Fuyu
- Support Persimmon
- Support Deplot
- Support Phi-3-Mini with long Rope
- Support Neva
- Support Kosmos-2
- Support RecurrentGemma
Fixed Issues
- Fixed some unexpected behaviors in beam search and early stopping, so that the outputs are more accurate.
- Fixed segmentation fault with pipeline parallelism and
gather_all_token_logits
. (#1284) - Removed the unnecessary check in XQA to fix code Llama 70b Triton crashes. (#1256)
- Fixed an unsupported ScalarType issue for BF16 LoRA. (triton-inference-server/tensorrtllm_backend#403)
- Eliminated the load and save of prompt table in multimodal. (#1436)
- Fixed an error when converting the models weights of Qwen 72B INT4-GPTQ. (#1344)
- Fixed early stopping and failures on in-flight batching cases of Medusa. (#1449)
- Added support for more NVLink versions for auto parallelism. (#1467)
- Fixed the assert failure caused by default values of sampling config. (#1447)
- Fixed a requirement specification on Windows for nvidia-cudnn-cu12. (#1446)
- Fixed MMHA relative position calculation error in
gpt_attention_plugin
for enc-dec models. (#1343)
Infrastructure changes
- Base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:24.03-py3
. - Base Docker image for TensorRT-LLM backend is updated to
nvcr.io/nvidia/tritonserver:24.03-py3
. - The dependent TensorRT version is updated to 10.0.1.
- The dependent CUDA version is updated to 12.4.0.
- The dependent PyTorch version is updated to 2.2.2.
Currently, there are two key branches in the project:
- The rel branch is the stable branch for the release of TensorRT-LLM. It has been QA-ed and carefully tested.
- The main branch is the dev branch. It is more experimental.
We are updating the main
branch regularly with new features, bug fixes and performance optimizations. The rel
branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team