Releases: NVIDIA/TensorRT-LLM
TensorRT-LLM 0.14.0 Release
Hi,
We are very pleased to announce the 0.14.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
- Enhanced the
LLM
class in the LLM API.- Added support for calibration with offline dataset.
- Added support for Mamba2.
- Added support for
finish_reason
andstop_reason
.
- Added FP8 support for CodeLlama.
- Added
__repr__
methods for classModule
, thanks to the contribution from @1ytic in #2191. - Added BFloat16 support for fused gated MLP.
- Updated ReDrafter beam search logic to match Apple ReDrafter v1.1.
- Improved
customAllReduce
performance. - Draft model now can copy logits directly over MPI to the target model's process in
orchestrator
mode. This fast logits copy reduces the delay between draft token generation and the beginning of target model inference. - NVIDIA Volta GPU support is deprecated and will be removed in a future release.
API Changes
- [BREAKING CHANGE] The default
max_batch_size
of thetrtllm-build
command is set to2048
. - [BREAKING CHANGE] Remove
builder_opt
from theBuildConfig
class and thetrtllm-build
command. - Add logits post-processor support to the
ModelRunnerCpp
class. - Added
isParticipant
method to the C++Executor
API to check if the current process is a participant in the executor instance.
Model Updates
- Added support for NemotronNas, see
examples/nemotron_nas/README.md
. - Added support for Deepseek-v1, see
examples/deepseek_v1/README.md
. - Added support for Phi-3.5 models, see
examples/phi/README.md
.
Fixed Issues
- Fixed a typo in
tensorrt_llm/models/model_weights_loader.py
, thanks to the contribution from @wangkuiyi in #2152. - Fixed duplicated import module in
tensorrt_llm/runtime/generation.py
, thanks to the contribution from @lkm2835 in #2182. - Enabled
share_embedding
for the models that have nolm_head
in legacy checkpoint conversion path, thanks to the contribution from @lkm2835 in #2232. - Fixed
kv_cache_type
issue in the Python benchmark, thanks to the contribution from @qingquansong in #2219. - Fixed an issue with SmoothQuant calibration with custom datasets. Thanks to the contribution by @Bhuvanesh09 in #2243.
- Fixed an issue surrounding
trtllm-build --fast-build
with fake or random weights. Thanks to @ZJLi2013 for flagging it in #2135. - Fixed missing
use_fused_mlp
when constructingBuildConfig
from dict, thanks for the fix from @ethnzhng in #2081. - Fixed lookahead batch layout for
numNewTokensCumSum
. (#2263)
Infrastructure Changes
- The dependent ModelOpt version is updated to v0.17.
Documentation
- @Sherlock113 added a tech blog to the latest news in #2169, thanks for the contribution.
Known Issues
- Replit Code is not supported with the transformers 4.45+
We are updating the main
branch regularly with new features, bug fixes and performance optimizations. The rel
branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team
TensorRT-LLM 0.13.0 Release
Hi,
We are very pleased to announce the 0.13.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
- Supported lookahead decoding (experimental), see
docs/source/speculative_decoding.md
. - Added some enhancements to the
ModelWeightsLoader
(a unified checkpoint converter, seedocs/source/architecture/model-weights-loader.md
).- Supported Qwen models.
- Supported auto-padding for indivisible TP shape in INT4-wo/INT8-wo/INT4-GPTQ.
- Improved performance on
*.bin
and*.pth
.
- Supported OpenAI Whisper in C++ runtime.
- Added some enhancements to the
LLM
class.- Supported LoRA.
- Supported engine building using dummy weights.
- Supported
trust_remote_code
for customized models and tokenizers downloaded from Hugging Face Hub.
- Supported beam search for streaming mode.
- Supported tensor parallelism for Mamba2.
- Supported returning generation logits for streaming mode.
- Added
curand
andbfloat16
support forReDrafter
. - Added sparse mixer normalization mode for MoE models.
- Added support for QKV scaling in FP8 FMHA.
- Supported FP8 for MoE LoRA.
- Supported KV cache reuse for P-Tuning and LoRA.
- Supported in-flight batching for CogVLM models.
- Supported LoRA for the
ModelRunnerCpp
class. - Supported
head_size=48
cases for FMHA kernels. - Added FP8 examples for DiT models, see
examples/dit/README.md
. - Supported decoder with encoder input features for the C++
executor
API.
API Changes
- [BREAKING CHANGE] Set
use_fused_mlp
toTrue
by default. - [BREAKING CHANGE] Enabled
multi_block_mode
by default. - [BREAKING CHANGE] Enabled
strongly_typed
by default inbuilder
API. - [BREAKING CHANGE] Renamed
maxNewTokens
,randomSeed
andminLength
tomaxTokens
,seed
andminTokens
following OpenAI style. - The
LLM
class- [BREAKING CHANGE] Updated
LLM.generate
arguments to includePromptInputs
andtqdm
.
- [BREAKING CHANGE] Updated
- The C++
executor
API- [BREAKING CHANGE] Added
LogitsPostProcessorConfig
. - Added
FinishReason
toResult
.
- [BREAKING CHANGE] Added
Model Updates
- Supported Gemma 2, see "Run Gemma 2" section in
examples/gemma/README.md
.
Fixed Issues
- Fixed an accuracy issue when enabling remove padding issue for cross attention. (#1999)
- Fixed the failure in converting qwen2-0.5b-instruct when using
smoothquant
. (#2087) - Matched the
exclude_modules
pattern inconvert_utils.py
to the changes inquantize.py
. (#2113) - Fixed build engine error when
FORCE_NCCL_ALL_REDUCE_STRATEGY
is set. - Fixed unexpected truncation in the quant mode of
gpt_attention
. - Fixed the hang caused by race condition when canceling requests.
- Fixed the default factory for
LoraConfig
. (#1323)
Infrastructure Changes
- Base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:24.07-py3
. - Base Docker image for TensorRT-LLM Backend is updated to
nvcr.io/nvidia/tritonserver:24.07-py3
. - The dependent TensorRT version is updated to 10.4.0.
- The dependent CUDA version is updated to 12.5.1.
- The dependent PyTorch version is updated to 2.4.0.
- The dependent ModelOpt version is updated to v0.15.
We are updating the main
branch regularly with new features, bug fixes and performance optimizations. The rel
branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team
TensorRT-LLM 0.12.0 Release
Hi,
We are very pleased to announce the 0.12.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
- Supported LoRA for MoE models.
- The
ModelWeightsLoader
is enabled for LLaMA family models (experimental), seedocs/source/architecture/model-weights-loader.md
. - Supported FP8 FMHA for NVIDIA Ada Lovelace Architecture.
- Supported GPT-J, Phi, Phi-3, Qwen, GPT, GLM, Baichuan, Falcon and Gemma models for the
LLM
class. - Supported FP8 OOTB MoE.
- Supported Starcoder2 SmoothQuant. (#1886)
- Supported ReDrafter Speculative Decoding, see “ReDrafter” section in
docs/source/speculative_decoding.md
. - Supported padding removal for BERT, thanks to the contribution from @Altair-Alpha in #1834.
- Added in-flight batching support for GLM 10B model.
- Supported
gelu_pytorch_tanh
activation function, thanks to the contribution from @ttim in #1897. - Added
chunk_length
parameter to Whisper, thanks to the contribution from @MahmoudAshraf97 in #1909. - Added
concurrency
argument forgptManagerBenchmark
. - Executor API supports requests with different beam widths, see
docs/source/executor.md#sending-requests-with-different-beam-widths
. - Added the flag
--fast_build
totrtllm-build
command (experimental).
API Changes
- [BREAKING CHANGE]
max_output_len
is removed fromtrtllm-build
command, if you want to limit sequence length on engine build stage, specifymax_seq_len
. - [BREAKING CHANGE] The
use_custom_all_reduce
argument is removed fromtrtllm-build
. - [BREAKING CHANGE] The
multi_block_mode
argument is moved from build stage (trtllm-build
and builder API) to the runtime. - [BREAKING CHANGE] The build time argument
context_fmha_fp32_acc
is moved to runtime for decoder models. - [BREAKING CHANGE] The arguments
tp_size
,pp_size
andcp_size
is removed fromtrtllm-build
command. - The C++ batch manager API is deprecated in favor of the C++
executor
API, and it will be removed in a future release of TensorRT-LLM. - Added a version API to the C++ library, a
cpp/include/tensorrt_llm/executor/version.h
file is going to be generated.
Model Updates
- Supported LLaMA 3.1 model.
- Supported Mamba-2 model.
- Supported EXAONE model, see
examples/exaone/README.md
. - Supported Qwen 2 model.
- Supported GLM4 models, see
examples/chatglm/README.md
. - Added LLaVa-1.6 (LLaVa-NeXT) multimodal support, see “LLaVA, LLaVa-NeXT and VILA” section in
examples/multimodal/README.md
.
Fixed Issues
- Fixed wrong pad token for the CodeQwen models. (#1953)
- Fixed typo in
cluster_infos
defined intensorrt_llm/auto_parallel/cluster_info.py
, thanks to the contribution from @saeyoonoh in #1987. - Removed duplicated flags in the command at
docs/source/reference/troubleshooting.md
, thanks for the contribution from @hattizai in #1937. - Fixed segmentation fault in TopP sampling layer, thanks to the contribution from @akhoroshev in #2039. (#2040)
- Fixed the failure when converting the checkpoint for Mistral Nemo model. (#1985)
- Propagated
exclude_modules
to weight-only quantization, thanks to the contribution from @fjosw in #2056. - Fixed wrong links in README, thanks to the contribution from @Tayef-Shah in #2028.
- Fixed some typos in the documentation, thanks to the contribution from @lfz941 in #1939.
- Fixed the engine build failure when deduced
max_seq_len
is not an integer. (#2018)
Infrastructure Changes
- Base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:24.07-py3
. - Base Docker image for TensorRT-LLM Backend is updated to
nvcr.io/nvidia/tritonserver:24.07-py3
. - The dependent TensorRT version is updated to 10.3.0.
- The dependent CUDA version is updated to 12.5.1.
- The dependent PyTorch version is updated to 2.4.0.
- The dependent ModelOpt version is updated to v0.15.0.
Known Issues
- On Windows, installation of TensorRT-LLM may succeed, but you might hit
OSError: exception: access violation reading 0x0000000000000000
when importing the library in Python. See Installing on Windows for workarounds.
Currently, there are two key branches in the project:
- The rel branch is the stable branch for the release of TensorRT-LLM. It has been QA-ed and carefully tested.
- The main branch is the dev branch. It is more experimental.
We are updating the main
branch regularly with new features, bug fixes and performance optimizations. The rel
branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team
TensorRT-LLM 0.11.0 Release
Hi,
We are very pleased to announce the 0.11.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
- Supported very long context for LLaMA (see “Long context evaluation” section in
examples/llama/README.md
). - Low latency optimization
- Added a reduce-norm feature which aims to fuse the ResidualAdd and LayerNorm kernels after AllReduce into a single kernel, which is recommended to be enabled when the batch size is small and the generation phase time is dominant.
- Added FP8 support to the GEMM plugin, which benefits the cases when batch size is smaller than 4.
- Added a fused GEMM-SwiGLU plugin for FP8 on SM90.
- LoRA enhancements
- Supported running FP8 LLaMA with FP16 LoRA checkpoints.
- Added support for quantized base model and FP16/BF16 LoRA.
- SQ OOTB (- INT8 A/W) + FP16/BF16/FP32 LoRA
- INT8/ INT4 Weight-Only (INT8 /W) + FP16/BF16/FP32 LoRA
- Weight-Only Group-wise + FP16/BF16/FP32 LoRA
- Added LoRA support to Qwen2, see “Run models with LoRA” section in
examples/qwen/README.md
. - Added support for Phi-3-mini/small FP8 base + FP16/BF16 LoRA, see “Run Phi-3 with LoRA” section in
examples/phi/README.md
. - Added support for starcoder-v2 FP8 base + FP16/BF16 LoRA, see “Run StarCoder2 with LoRA” section in
examples/gpt/README.md
.
- Encoder-decoder models C++ runtime enhancements
- Supported paged KV cache and inflight batching. (#800)
- Supported tensor parallelism.
- Supported INT8 quantization with embedding layer excluded.
- Updated default model for Whisper to
distil-whisper/distil-large-v3
, thanks to the contribution from @IbrahimAmin1 in #1337. - Supported HuggingFace model automatically download for the Python high level API.
- Supported explicit draft tokens for in-flight batching.
- Supported local custom calibration datasets, thanks to the contribution from @DreamGenX in #1762.
- Added batched logits post processor.
- Added Hopper qgmma kernel to XQA JIT codepath.
- Supported tensor parallelism and expert parallelism enabled together for MoE.
- Supported the pipeline parallelism cases when the number of layers cannot be divided by PP size.
- Added
numQueuedRequests
to the iteration stats log of the executor API. - Added
iterLatencyMilliSec
to the iteration stats log of the executor API. - Add HuggingFace model zoo from the community, thanks to the contribution from @matichon-vultureprime in #1674.
API Changes
- [BREAKING CHANGE]
trtllm-build
command- Migrated Whisper to unified workflow (
trtllm-build
command), see documents: examples/whisper/README.md. max_batch_size
intrtllm-build
command is switched to 256 by default.max_num_tokens
intrtllm-build
command is switched to 8192 by default.- Deprecated
max_output_len
and addedmax_seq_len
. - Removed unnecessary
--weight_only_precision
argument fromtrtllm-build
command. - Removed
attention_qk_half_accumulation
argument fromtrtllm-build
command. - Removed
use_context_fmha_for_generation
argument fromtrtllm-build
command. - Removed
strongly_typed
argument fromtrtllm-build
command. - The default value of
max_seq_len
reads from the HuggingFace mode config now.
- Migrated Whisper to unified workflow (
- C++ runtime
- [BREAKING CHANGE] Renamed
free_gpu_memory_fraction
inModelRunnerCpp
tokv_cache_free_gpu_memory_fraction
. - [BREAKING CHANGE] Refactored
GptManager
API- Moved
maxBeamWidth
intoTrtGptModelOptionalParams
. - Moved
schedulerConfig
intoTrtGptModelOptionalParams
.
- Moved
- Added some more options to
ModelRunnerCpp
, includingmax_tokens_in_paged_kv_cache
,kv_cache_enable_block_reuse
andenable_chunked_context
.
- [BREAKING CHANGE] Renamed
- [BREAKING CHANGE] Python high-level API
- Removed the
ModelConfig
class, and all the options are moved toLLM
class. - Refactored the
LLM
class, please refer toexamples/high-level-api/README.md
- Moved the most commonly used options in the explicit arg-list, and hidden the expert options in the kwargs.
- Exposed
model
to accept either HuggingFace model name or local HuggingFace model/TensorRT-LLM checkpoint/TensorRT-LLM engine. - Support downloading model from HuggingFace model hub, currently only Llama variants are supported.
- Support build cache to reuse the built TensorRT-LLM engines by setting environment variable
TLLM_HLAPI_BUILD_CACHE=1
or passingenable_build_cache=True
toLLM
class. - Exposed low-level options including
BuildConfig
,SchedulerConfig
and so on in the kwargs, ideally you should be able to configure details about the build and runtime phase.
- Refactored
LLM.generate()
andLLM.generate_async()
API.- Removed
SamplingConfig
. - Added
SamplingParams
with more extensive parameters, seetensorrt_llm/hlapi/utils.py
.- The new
SamplingParams
contains and manages fields from Python bindings ofSamplingConfig
,OutputConfig
, and so on.
- The new
- Refactored
LLM.generate()
output asRequestOutput
, seetensorrt_llm/hlapi/llm.py
.
- Removed
- Updated the
apps
examples, specially by rewriting bothchat.py
andfastapi_server.py
using theLLM
APIs, please refer to theexamples/apps/README.md
for details.- Updated the
chat.py
to support multi-turn conversation, allowing users to chat with a model in the terminal. - Fixed the
fastapi_server.py
and eliminate the need formpirun
in multi-GPU scenarios.
- Updated the
- Removed the
- [BREAKING CHANGE] Speculative decoding configurations unification
- Introduction of
SpeculativeDecodingMode.h
to choose between different speculative decoding techniques. - Introduction of
SpeculativeDecodingModule.h
base class for speculative decoding techniques. - Removed
decodingMode.h
.
- Introduction of
gptManagerBenchmark
- [BREAKING CHANGE]
api
ingptManagerBenchmark
command isexecutor
by default now. - Added a runtime
max_batch_size
. - Added a runtime
max_num_tokens
.
- [BREAKING CHANGE]
- [BREAKING CHANGE] Added a
bias
argument to theLayerNorm
module, and supports non-bias layer normalization. - [BREAKING CHANGE] Removed
GptSession
Python bindings.
Model Updates
- Supported Jais, see
examples/jais/README.md
. - Supported DiT, see
examples/dit/README.md
. - Supported VILA 1.5.
- Supported Video NeVA, see
Video NeVA
section inexamples/multimodal/README.md
. - Supported Grok-1, see
examples/grok/README.md
. - Supported Qwen1.5-110B with FP8 PTQ.
- Supported Phi-3 small model with block sparse attention.
- Supported InternLM2 7B/20B, thanks to the contribution from @RunningLeon in #1392.
- Supported Phi-3-medium models, see
examples/phi/README.md
. - Supported Qwen1.5 MoE A2.7B.
- Supported phi 3 vision multimodal.
Fixed Issues
- Fixed brokens outputs for the cases when batch size is larger than 1. (#1539)
- Fixed
top_k
type inexecutor.py
, thanks to the contribution from @vonjackustc in #1329. - Fixed stop and bad word list pointer offset in Python runtime, thanks to the contribution from @fjosw in #1486.
- Fixed some typos for Whisper model, thanks to the contribution from @Pzzzzz5142 in #1328.
- Fixed export failure with CUDA driver < 526 and pynvml >= 11.5.0, thanks to the contribution from @CoderHam in #1537.
- Fixed an issue in NMT weight conversion, thanks to the contribution from @Pzzzzz5142 in #1660.
- Fixed LLaMA Smooth Quant conversion, thanks to the contribution from @lopuhin in #1650.
- Fixed
qkv_bias
shape issue for Qwen1.5-32B (#1589), thanks to the contribution from @Tlntin in #1637. - Fixed the error of Ada traits for
fpA_intB
, thanks to the contribution from @JamesTheZ in #1583. - Update
examples/qwenvl/requirements.txt
, thanks to the contribution from @ngoanpv in #1248. - Fixed rsLoRA scaling in
lora_manager
, thanks to the contribution from @TheCodeWrangler in #1669. - Fixed Qwen1.5 checkpoint convert failure #1675.
- Fixed Medusa safetensors and AWQ conversion, thanks to the contribution from @Tushar-ml in #1535.
- Fixed
convert_hf_mpt_legacy
call failure when the function is called in other than global scope, thanks to the contribution from @bloodeagle40234 in #1534. - Fixed
use_fp8_context_fmha
broken outputs (#1539). - Fixed pre-norm weight conversion for NMT models, thanks to the contribution from @Pzzzzz5142 in #1723.
- Fixed random seed initialization issue, thanks to the contribution from @pathorn in #1742.
- Fixed stop words and bad words in python bindings. (#1642)
- Fixed the issue that when converting checkpoint for Mistral 7B v0.3, thanks to the contribution from @Ace-RR: #1732.
- Fixed broken inflight batching for fp8 Llama and Mixtral, thanks to the contribution from @bprus: #1738
- Fixed the failure when
quantize.py
is export data to config.json, thanks to the contribution from @janpetrov: #1676 - Raise error when autopp detects unsupported quant plugin #1626.
- Fixed the issue that
shared_embedding_table
is not being set when loading Gemma #1799, thanks to the contribution from @mfuntowicz. - Fixed stop and bad words list contiguous for
ModelRunner
#1815, thanks to the contribution from @Marks101. - Fixed missing comment for
FAST_BUILD
, thanks to the support from @lkm2835 in #1851. - Fixed the issues that Top-P sampling occasionally produces invalid tokens. #1590
- Fixed #1424.
- Fixed #1529.
- Fixed
benchmarks/cpp/README.md
for #1562 and #1552. - Fixed dead link, thanks to the help from @DefTruth, @buvnswrn and @sunjiabin17 in: triton-inference-server/tensorrtllm_backend#478, triton-inference-server/tensorrtllm_backend#482 and triton-inference-server/tensorrtllm_backend#449.
Infrastructure Changes
- Base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:24.05-py3
. - Base Docker image for TensorRT-LLM backend is updated to `nvcr.io/nvidia/...
TensorRT-LLM 0.10.0 Release
Hi,
We are very pleased to announce the 0.10.0 version of TensorRT-LLM. It has been an intense effort, and we hope that it will enable you to easily deploy GPU-based inference for state-of-the-art LLMs. We want TensorRT-LLM to help you run those LLMs very fast.
This update includes:
Key Features and Enhancements
- The Python high level API
- Added embedding parallel, embedding sharing, and fused MLP support.
- Enabled the usage of the
executor
API.
- Added a weight-stripping feature with a new
trtllm-refit
command. For more information, refer toexamples/sample_weight_stripping/README.md
. - Added a weight-streaming feature. For more information, refer to
docs/source/advanced/weight-streaming.md
. - Enhanced the multiple profiles feature;
--multiple_profiles
argument intrtllm-build
command builds more optimization profiles now for better performance. - Added FP8 quantization support for Mixtral.
- Added support for pipeline parallelism for GPT.
- Optimized
applyBiasRopeUpdateKVCache
kernel by avoiding re-computation. - Reduced overheads between
enqueue
calls of TensorRT engines. - Added support for paged KV cache for enc-dec models. The support is limited to beam width 1.
- Added W4A(fp)8 CUTLASS kernels for the NVIDIA Ada Lovelace architecture.
- Added debug options (
--visualize_network
and--dry_run
) to thetrtllm-build
command to visualize the TensorRT network before engine build. - Integrated the new NVIDIA Hopper XQA kernels for LLaMA 2 70B model.
- Improved the performance of pipeline parallelism when enabling in-flight batching.
- Supported quantization for Nemotron models.
- Added LoRA support for Mixtral and Qwen.
- Added in-flight batching support for ChatGLM models.
- Added support to
ModelRunnerCpp
so that it runs with theexecutor
API for IFB-compatible models. - Enhanced the custom
AllReduce
by adding a heuristic; fall back to use native NCCL kernel when hardware requirements are not satisfied to get the best performance. - Optimized the performance of checkpoint conversion process for LLaMA.
- Benchmark
- [BREAKING CHANGE] Moved the request rate generation arguments and logic from prepare dataset script to
gptManagerBenchmark
. - Enabled streaming and support
Time To the First Token (TTFT)
latency andInter-Token Latency (ITL)
metrics forgptManagerBenchmark
. - Added the
--max_attention_window
option togptManagerBenchmark
.
- [BREAKING CHANGE] Moved the request rate generation arguments and logic from prepare dataset script to
API Changes
- [BREAKING CHANGE] Set the default
tokens_per_block
argument of thetrtllm-build
command to 64 for better performance. - [BREAKING CHANGE] Migrated enc-dec models to the unified workflow.
- [BREAKING CHANGE] Renamed
GptModelConfig
toModelConfig
. - [BREAKING CHANGE] Added speculative decoding mode to the builder API.
- [BREAKING CHANGE] Refactor scheduling configurations
- Unified the
SchedulerPolicy
with the same name inbatch_scheduler
andexecutor
, and renamed it toCapacitySchedulerPolicy
. - Expanded the existing configuration scheduling strategy from
SchedulerPolicy
toSchedulerConfig
to enhance extensibility. The latter also introduces a chunk-based configuration calledContextChunkingPolicy
.
- Unified the
- [BREAKING CHANGE] The input prompt was removed from the generation output in the
generate()
andgenerate_async()
APIs. For example, when given a prompt asA B
, the original generation result could be<s>A B C D E
where onlyC D E
is the actual output, and now the result isC D E
. - [BREAKING CHANGE] Switched default
add_special_token
in the TensorRT-LLM backend toTrue
. - Deprecated
GptSession
andTrtGptModelV1
.
Model Updates
- Support DBRX
- Support Qwen2
- Support CogVLM
- Support ByT5
- Support LLaMA 3
- Support Arctic (w/ FP8)
- Support Fuyu
- Support Persimmon
- Support Deplot
- Support Phi-3-Mini with long Rope
- Support Neva
- Support Kosmos-2
- Support RecurrentGemma
Fixed Issues
- Fixed some unexpected behaviors in beam search and early stopping, so that the outputs are more accurate.
- Fixed segmentation fault with pipeline parallelism and
gather_all_token_logits
. (#1284) - Removed the unnecessary check in XQA to fix code Llama 70b Triton crashes. (#1256)
- Fixed an unsupported ScalarType issue for BF16 LoRA. (triton-inference-server/tensorrtllm_backend#403)
- Eliminated the load and save of prompt table in multimodal. (#1436)
- Fixed an error when converting the models weights of Qwen 72B INT4-GPTQ. (#1344)
- Fixed early stopping and failures on in-flight batching cases of Medusa. (#1449)
- Added support for more NVLink versions for auto parallelism. (#1467)
- Fixed the assert failure caused by default values of sampling config. (#1447)
- Fixed a requirement specification on Windows for nvidia-cudnn-cu12. (#1446)
- Fixed MMHA relative position calculation error in
gpt_attention_plugin
for enc-dec models. (#1343)
Infrastructure changes
- Base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:24.03-py3
. - Base Docker image for TensorRT-LLM backend is updated to
nvcr.io/nvidia/tritonserver:24.03-py3
. - The dependent TensorRT version is updated to 10.0.1.
- The dependent CUDA version is updated to 12.4.0.
- The dependent PyTorch version is updated to 2.2.2.
Currently, there are two key branches in the project:
- The rel branch is the stable branch for the release of TensorRT-LLM. It has been QA-ed and carefully tested.
- The main branch is the dev branch. It is more experimental.
We are updating the main
branch regularly with new features, bug fixes and performance optimizations. The rel
branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team
TensorRT-LLM 0.9.0 Release
Hi,
We are very pleased to announce the 0.9.0 version of TensorRT-LLM. It has been an intense effort, and we hope that it will enable you to easily deploy GPU-based inference for state-of-the-art LLMs. We want TensorRT-LLM to help you run those LLMs very fast.
This update includes:
- Model Support
- Support distil-whisper, thanks to the contribution from @Bhuvanesh09 in PR #1061
- Support HuggingFace StarCoder2
- Support VILA
- Support Smaug-72B-v0.1
- Migrate BLIP-2 examples to
examples/multimodal
- Features
- [BREAKING CHANGE] TopP sampling optimization with deterministic AIR TopP algorithm is enabled by default
- [BREAKING CHANGE] Support embedding sharing for Gemma
- Add support to context chunking to work with KV cache reuse
- Enable different rewind tokens per sequence for Medusa
- BART LoRA support (limited to the Python runtime)
- Enable multi-LoRA for BART LoRA
- Support
early_stopping=False
in beam search for C++ Runtime - Add logits post processor to the batch manager (see docs/source/batch_manager.md#logits-post-processor-optional)
- Support import and convert HuggingFace Gemma checkpoints, thanks for the contribution from @mfuntowicz in #1147
- Support loading Gemma from HuggingFace
- Support auto parallelism planner for high-level API and unified builder workflow
- Support run
GptSession
without OpenMPI #1220 - Medusa IFB support
- [Experimental] Support FP8 FMHA, note that the performance is not optimal, and we will keep optimizing it
- More head sizes support for LLaMA-like models
- Ampere (sm80, sm86), Ada (sm89), Hopper(sm90) all support head sizes [32, 40, 64, 80, 96, 104, 128, 160, 256] now.
- OOTB functionality support
- T5
- Mixtral 8x7B
- API
- C++
executor
API- Add Python bindings, see documentation and examples in
examples/bindings
- Add advanced and multi-GPU examples for Python binding of
executor
C++ API, seeexamples/bindings/README.md
- Add documents for C++
executor
API, seedocs/source/executor.md
- Add Python bindings, see documentation and examples in
- High-level API (refer to
examples/high-level-api/README.md
for guidance)- [BREAKING CHANGE] Reuse the
QuantConfig
used intrtllm-build
tool, support broader quantization features - Support in
LLM()
API to accept engines built bytrtllm-build
command - Add support for TensorRT-LLM checkpoint as model input
- Refine
SamplingConfig
used inLLM.generate
orLLM.generate_async
APIs, with the support of beam search, a variety of penalties, and more features - Add support for the StreamingLLM feature, enable it by setting
LLM(streaming_llm=...)
- Migrate Mixtral to high level API and unified builder workflow
- [BREAKING CHANGE] Reuse the
- [BREAKING CHANGE] Refactored Qwen model to the unified build workflow, see
examples/qwen/README.md
for the latest commands - [BREAKING CHANGE] Move LLaMA convert checkpoint script from examples directory into the core library
- [BREAKING CHANGE] Refactor GPT with unified building workflow, see
examples/gpt/README.md
for the latest commands - [BREAKING CHANGE] Removed all the lora related flags from convert_checkpoint.py script and the checkpoint content to
trtllm-build
command, to generalize the feature better to more models - [BREAKING CHANGE] Removed the use_prompt_tuning flag and options from convert_checkpoint.py script and the checkpoint content, to generalize the feature better to more models. Use the
trtllm-build --max_prompt_embedding_table_size
instead. - [BREAKING CHANGE] Changed the
trtllm-build --world_size
flag to--auto_parallel
flag, the option is used for auto parallel planner only. - [BREAKING CHANGE]
AsyncLLMEngine
is removed,tensorrt_llm.GenerationExecutor
class is refactored to work with both explicitly launching withmpirun
in the application level, and accept an MPI communicator created bympi4py
- [BREAKING CHANGE]
examples/server
are removed, seeexamples/app
instead. - [BREAKING CHANGE] Remove LoRA related parameters from convert checkpoint scripts
- [BREAKING CHANGE] Simplify Qwen convert checkpoint script
- [BREAKING CHANGE] Remove
model
parameter fromgptManagerBenchmark
andgptSessionBenchmark
- C++
- Bug fixes
- Fix a weight-only quant bug for Whisper to make sure that the
encoder_input_len_range
is not 0, thanks to the contribution from @Eddie-Wang1120 in #992 - Fix the issue that log probabilities in Python runtime are not returned #983
- Multi-GPU fixes for multimodal examples #1003
- Fix wrong
end_id
issue for Qwen #987 - Fix a non-stopping generation issue #1118 #1123
- Fix wrong link in examples/mixtral/README.md #1181
- Fix LLaMA2-7B bad results when int8 kv cache and per-channel int8 weight only are enabled #967
- Fix wrong
head_size
when importing Gemma model from HuggingFace Hub, thanks for the contribution from @mfuntowicz in #1148 - Fix ChatGLM2-6B building failure on INT8 #1239
- Fix wrong relative path in Baichuan documentation #1242
- Fix wrong
SamplingConfig
tensors inModelRunnerCpp
#1183 - Fix error when converting SmoothQuant LLaMA #1267
- Fix the issue that
examples/run.py
only load one line from--input_file
- Fix the issue that
ModelRunnerCpp
does not transferSamplingConfig
tensor fields correctly #1183
- Fix a weight-only quant bug for Whisper to make sure that the
- Benchmark
- Add emulated static batching in
gptManagerBenchmark
- Support arbitrary dataset from HuggingFace for C++ benchmarks, see “Prepare dataset” section in
benchmarks/cpp/README.md
- Add percentile latency report to
gptManagerBenchmark
- Add emulated static batching in
- Performance
- Infra
- Base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:24.02-py3
- Base Docker image for TensorRT-LLM backend is updated to
nvcr.io/nvidia/tritonserver:24.02-py3
- The dependent TensorRT version is updated to 9.3
- The dependent PyTorch version is updated to 2.2
- The dependent CUDA version is updated to 12.3.2 (a.k.a. 12.3 Update 2)
- Base Docker image for TensorRT-LLM is updated to
Currently, there are two key branches in the project:
- The rel branch is the stable branch for the release of TensorRT-LLM. It has been QA-ed and carefully tested.
- The main branch is the dev branch. It is more experimental.
We are updating the main branch regularly with new features, bug fixes and performance optimizations. The stable branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team
TensorRT-LLM 0.8.0 Release
Hi,
We are very pleased to announce the 0.8.0 version of TensorRT-LLM. It has been an intense effort, and we hope that it will enable you to easily deploy GPU-based inference for state-of-the-art LLMs. We want TensorRT-LLM to help you run those LLMs very fast.
This update includes:
- Model Support
- Phi-1.5/2.0
- Mamba support (see examples/mamba/README.md)
- The support is limited to beam width = 1 and single-node single-GPU
- Nougat support (see examples/multimodal/README.md#nougat)
- Qwen-VL support (see examples/qwenvl/README.md)
- RoBERTa support, thanks to the contribution from @erenup
- Skywork model support
- Add example for multimodal models (BLIP with OPT or T5, LlaVA)
- Features
- Chunked context support (see docs/source/gpt_attention.md#chunked-context)
- LoRA support for C++ runtime (see docs/source/lora.md)
- Medusa decoding support (see examples/medusa/README.md)
- The support is limited to Python runtime for Ampere or newer GPUs with fp16 and bf16 accuracy, and the
temperature
parameter of sampling configuration should be 0
- The support is limited to Python runtime for Ampere or newer GPUs with fp16 and bf16 accuracy, and the
- StreamingLLM support for LLaMA (see docs/source/gpt_attention.md#streamingllm)
- Support for batch manager to return logits from context and/or generation phases
- Include support in the Triton backend
- Support AWQ and GPTQ for QWEN
- Support ReduceScatter plugin
- Support for combining
repetition_penalty
andpresence_penalty
#274 - Support for
frequency_penalty
#275 - OOTB functionality support:
- Baichuan
- InternLM
- Qwen
- BART
- LLaMA
- Support enabling INT4-AWQ along with FP8 KV Cache
- Support BF16 for weight-only plugin
- Baichuan
- P-tuning support
- INT4-AWQ and INT4-GPTQ support
- Decoder iteration-level profiling improvements
- Add
masked_select
andcumsum
function for modeling - Smooth Quantization support for ChatGLM2-6B / ChatGLM3-6B / ChatGLM2-6B-32K
- Add Weight-Only Support To Whisper #794, thanks to the contribution from @Eddie-Wang1120
- Support FP16 fMHA on NVIDIA V100 GPU
- API
- Add a set of High-level APIs for end-to-end generation tasks (see examples/high-level-api/README.md)
- [BREAKING CHANGES] Migrate models to the new build workflow, including LLaMA, Mistral, Mixtral, InternLM, ChatGLM, Falcon, GPT-J, GPT-NeoX, Medusa, MPT, Baichuan and Phi (see docs/source/new_workflow.md)
- [BREAKING CHANGES] Deprecate
LayerNorm
andRMSNorm
plugins and removed corresponding build parameters - [BREAKING CHANGES] Remove optional parameter
maxNumSequences
for GPT manager
- Bug fixes
- Fix the first token being abnormal issue when
--gather_all_token_logits
is enabled #639 - Fix LLaMA with LoRA enabled build failure #673
- Fix InternLM SmoothQuant build failure #705
- Fix Bloom int8_kv_cache functionality #741
- Fix crash in
gptManagerBenchmark
#649 - Fix Blip2 build error #695
- Add pickle support for
InferenceRequest
#701 - Fix Mixtral-8x7b build failure with custom_all_reduce #825
- Fix INT8 GEMM shape #935
- Minor bug fixes
- Fix the first token being abnormal issue when
- Performance
- [BREAKING CHANGES] Increase default
freeGpuMemoryFraction
parameter from 0.85 to 0.9 for higher throughput - [BREAKING CHANGES] Disable
enable_trt_overlap
argument for GPT manager by default - Performance optimization of beam search kernel
- Add bfloat16 and paged kv cache support for optimized generation MQA/GQA kernels
- Custom AllReduce plugins performance optimization
- Top-P sampling performance optimization
- LoRA performance optimization
- Custom allreduce performance optimization by introducing a ping-pong buffer to avoid an extra synchronization cost
- Integrate XQA kernels for GPT-J (beamWidth=4)
- [BREAKING CHANGES] Increase default
- Documentation
- Batch manager arguments documentation updates
- Add documentation for best practices for tuning the performance of TensorRT-LLM (See docs/source/perf_best_practices.md)
- Add documentation for Falcon AWQ support (See examples/falcon/README.md)
- Update to the
docs/source/new_workflow.md
documentation - Update AWQ INT4 weight only quantization documentation for GPT-J
- Add blog: Speed up inference with SOTA quantization techniques in TRT-LLM
- Refine TensorRT-LLM backend README structure #133
- Typo fix #739
Currently, there are two key branches in the project:
- The rel branch is the stable branch for the release of TensorRT-LLM. It has been QA-ed and carefully tested.
- The main branch is the dev branch. It is more experimental.
We are updating the main branch regularly with new features, bug fixes and performance optimizations. The stable branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team
TensorRT-LLM 0.7.1 Release
Hi,
We are very pleased to announce the 0.7.1 version of TensorRT-LLM. It has been an intense effort, and we hope that it will enable you to easily deploy GPU-based inference for state-of-the-art LLMs. We want TensorRT-LLM to help you run those LLMs very fast.
This update includes:
- Models
- BART and mBART support in encoder-decoder models
- FairSeq Neural Machine Translation (NMT) family
- Mixtral-8x7B model
- Support weight loading for HuggingFace Mixtral model
- OpenAI Whisper
- Mixture of Experts support
- MPT - Int4 AWQ / SmoothQuant support
- Baichuan FP8 quantization support
- Features
- [Preview] Speculative decoding
- Add Python binding for
GptManager
- Add a Python class
ModelRunnerCpp
that wraps C++gptSession
- System prompt caching
- Enable split-k for weight-only cutlass kernels
- FP8 KV cache support for XQA kernel
- New Python builder API and
trtllm-build
command(already applied to blip2 and OPT ) - Support
StoppingCriteria
andLogitsProcessor
in Python generate API (thanks to the contribution from @zhang-ge-hao) - fMHA support for chunked attention and paged kv cache
- Bug fixes
- Performance
- MMHA optimization for MQA and GQA
- LoRA optimization: cutlass grouped gemm
- Optimize Hopper warp specialized kernels
- Optimize AllReduce for parallel attention on Falcon and GPT-J
- Enable split-k for weight-only cutlass kernel when SM>=75
- Documentation
Currently, there are two key branches in the project:
- The rel branch is the stable branch for the release of TensorRT-LLM. It has been QA-ed and carefully tested.
- The main branch is the dev branch. It is more experimental.
We are updating the main branch regularly with new features, bug fixes and performance optimizations. The stable branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team
TensorRT-LLM 0.6.1 Release
Hi,
We are very pleased to announce the 0.6.1 version of TensorRT-LLM. It has been an intense effort, and we hope that it will enable you to easily deploy GPU-based inference for state-of-the-art LLMs. We want TensorRT-LLM to help you run those LLMs very fast.
This update includes:
- Models
- ChatGLM3
- InternLM (contributed by @wangruohui)
- Mistral 7B (developed in collaboration with Mistral.AI)
- MQA/GQA support to MPT (and GPT) models (contributed by @bheilbrun)
- Qwen (contributed by @Tlntin and @zhaohb)
- Replit Code V-1.5 3B (contributed by @bheilbrun)
- T5, mT5, Flan-T5 (Python runtime only, contributed by @mlmonk and @nqbao11)
- Features
- Add runtime statistics related to active requests and KV cache utilization from the batch manager (see the batch manager documentation)
- Add
sequence_length
tensor to support proper lengths in beam-search (when beam-width > 1 - see tensorrt_llm/batch_manager/GptManager.h) - BF16 support for encoder-decoder models (Python runtime - see examples/enc_dec)
- Improvements to memory utilization (CPU and GPU - including memory leaks)
- Improved error reporting and memory consumption
- Improved support for stop and bad words
- INT8 SmoothQuant and INT8 KV Cache support for the Baichuan models (see examples/baichuan)
- INT4 AWQ Tensor Parallelism support and INT8 KV cache + AWQ/weight-only support for the GPT-J model (see examples/gptj)
- INT4 AWQ support for the Falcon models (see examples/falcon)
- LoRA support (functional preview only - limited to the Python runtime, only QKV support and not optimized in terms of runtime performance) for the GPT model (see the Run LoRA with the Nemo checkpoint in the GPT example)
- Multi-GPU support for encoder-decoder models (Python runtime - see examples/enc_dec)
- New heuristic for launching the Multi-block Masked MHA kernel (similar to FlashDecoding - see decoderMaskedMultiheadAttentionLaunch.h)
- Prompt-Tuning support for GPT and LLaMA models (see the Prompt-tuning Section in the GPT example)
- Performance optimizations in various CUDA kernels
- Possibility to exclude input tokens from the output (see
excludeInputInOutput
inGptManager
) - Python binding for the C++ runtime (GptSession - see
pybind
) - Support for different micro batch sizes for context and generation phases with pipeline parallelism (see
GptSession::Config::ctxMicroBatchSize
andGptSession::Config::genMicroBatchSize
in tensorrt_llm/runtime/gptSession.h) - Support for "remove input padding" for encoder-decoder models (see examples/enc_dec)
- Support for context and generation logits (see
mComputeContextLogits
andmComputeGenerationLogits
in tensorrt_llm/runtime/gptModelConfig.h) - Support for
logProbs
andcumLogProbs
(see"output_log_probs"
and"cum_log_probs"
inGptManager
) - Update to CUTLASS 3.x
- Bug fixes
- Fix for ChatGLM2 #93 and #138
- Fix tensor names error "RuntimeError: Tensor names (
host_max_kv_cache_length
) in engine are not the same as expected in the main branch" #369 - Fix weights split issue in BLOOM when
world_size = 2
("array split does not result in an equal division") #374 - Fix SmoothQuant multi-GPU failure with tensor parallelism is 2 #267
- Fix a crash in GenerationSession if stream keyword argument is not None #202
- Fix a typo when calling PyNVML API [BUG] code bug #410
- Fix bugs related to the improper management of the
end_id
for various models [C++ and Python] - Fix memory leaks [C++ code and Python models]
- Fix the std::alloc error when running the gptManagerBenchmark -- issue gptManagerBenchmark std::bad_alloc error #66
- Fix a bug in pipeline parallelism when beam-width > 1
- Fix a bug with Llama GPTQ due to improper support of GQA
- Fix issue #88
- Fix an issue with the Huggingface Transformers version #16
- Fix link jump in windows readme.md #30 - by @yuanlehome
- Fix typo in batchScheduler.h #56 - by @eltociear
- Fix typo #58 - by @RichardScottOZ
- Fix Multi-block MMHA: Difference between
max_batch_size
in the engine builder andmax_num_sequences
in TrtGptModelOptionalParams? #65 - Fix the log message to be more accurate on KV cache #224
- Fix Windows release wheel installation: Failed to install the release wheel for Windows using pip #261
- Fix missing torch dependencies: [BUG] The batch_manage.a choice error in --cpp-only when torch's cxx_abi version is different with gcc #151
- Fix linking error during compiling google-test & benchmarks #277
- Fix logits dtype for Baichuan and ChatGLM: segmentation fault caused by the lack of bfloat16 #335
- Minor bug fixes
Currently, there are two key branches in the project:
- The rel branch contains what we'd call the stable branch for the release of TensorRT-LLM. It has been QA-ed and carefully tested.
- The main branch contains what we'd call the dev branch. It is more experimental.
We are updating the main branch regularly with new features, bug fixes and performance optimizations. The stable branch will be updated less frequently. The exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team
The first release of TensorRT-LLM
revise the homepage (#14) Co-authored-by: Shi Xiaowei <xiaoweis@nvidia.com>