Update TensorRT-LLM Release branch (#1445)

* Update TensorRT-LLM --------- Co-authored-by: Bhuvanesh Sridharan <bhuvan.sridharan@gmail.com> Co-authored-by: Morgan Funtowicz <funtowiczmo@gmail.com> Co-authored-by: Eddie-Wang1120 <wangjinheng1120@163.com> Co-authored-by: meghagarwal <16129366+megha95@users.noreply.github.com>
NVIDIA · Apr 12, 2024 · 250d9c2 · 250d9c2
1 parent 37aee91
commit 250d9c2
Show file tree

Hide file tree

Showing 1,038 changed files with 3,439,884 additions and 389,685 deletions.
diff --git a/.clang-format b/.clang-format
@@ -59,6 +59,7 @@ PenaltyBreakString: 1000
 PenaltyExcessCharacter: 1000000
 PenaltyReturnTypeOnItsOwnLine: 60
 PointerAlignment: Left
+QualifierAlignment: Right
 ReflowComments:  true
 SeparateDefinitionBlocks: Always
 SortIncludes:    CaseSensitive

diff --git a/.gitignore b/.gitignore
@@ -17,6 +17,16 @@ venv/
 .local/
 .hypothesis/
 .idea/
+dump*/
+.trt-internal
+*.dot
+*.prof
+*.log
+*.pkl
+*.hdf5
+*.lock
+config.json
+/*.svg
 cpp/cmake-build-*
 cpp/.ccache/
 tensorrt_llm/libs

diff --git a/3rdparty/cutlass b/3rdparty/cutlass
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,80 @@
 # Change Log
 
+## Versions 0.8.0
+
+* Model Support
+  - Phi-1.5/2.0
+  - Mamba support (see examples/mamba/README.md)
+    - The support is limited to beam width = 1 and single-node single-GPU
+  - Nougat support (see examples/multimodal/README.md#nougat)
+  - Qwen-VL support (see examples/qwenvl/README.md)
+  - RoBERTa support, thanks to the contribution from @erenup
+  - Skywork model support
+  - Add example for multimodal models (BLIP with OPT or T5, LlaVA)
+* Features
+  - Chunked context support (see docs/source/gpt_attention.md#chunked-context)
+  - LoRA support for C++ runtime (see docs/source/lora.md)
+  - Medusa decoding support (see examples/medusa/README.md)
+    - The support is limited to Python runtime for Ampere or newer GPUs with fp16 and bf16 accuracy, and the `temperature` parameter of sampling configuration should be 0
+  - StreamingLLM support for LLaMA (see docs/source/gpt_attention.md#streamingllm)
+  - Support for batch manager to return logits from context and/or generation phases
+    - Include support in the Triton backend
+  - Support AWQ and GPTQ for QWEN
+  - Support ReduceScatter plugin
+  - Support for combining `repetition_penalty` and `presence_penalty` #274
+  - Support for `frequency_penalty` #275
+  - OOTB functionality support:
+    - Baichuan
+    - InternLM
+    - Qwen
+    - BART
+  - LLaMA
+    - Support enabling INT4-AWQ along with FP8 KV Cache
+    - Support BF16 for weight-only plugin
+  - Baichuan
+    - P-tuning support
+    - INT4-AWQ and INT4-GPTQ support
+  - Decoder iteration-level profiling improvements
+  - Add `masked_select` and `cumsum` function for modeling
+  - Smooth Quantization support for ChatGLM2-6B / ChatGLM3-6B / ChatGLM2-6B-32K
+  - Add Weight-Only Support To Whisper #794, thanks to the contribution from @Eddie-Wang1120
+  - Support FP16 fMHA on NVIDIA V100 GPU
+* API
+  - Add a set of High-level APIs for end-to-end generation tasks (see examples/high-level-api/README.md)
+  - **[BREAKING CHANGES]** Migrate models to the new build workflow, including LLaMA, Mistral, Mixtral, InternLM, ChatGLM, Falcon, GPT-J, GPT-NeoX, Medusa, MPT, Baichuan and Phi (see docs/source/checkpoint.md)
+  - **[BREAKING CHANGES]** Deprecate `LayerNorm` and `RMSNorm` plugins and removed corresponding build parameters
+  - **[BREAKING CHANGES]** Remove optional parameter `maxNumSequences` for GPT manager
+* Bug fixes
+  - Fix the first token being abnormal issue when `--gather_all_token_logits` is enabled #639
+  - Fix LLaMA with LoRA enabled build failure #673
+  - Fix InternLM SmoothQuant build failure #705
+  - Fix Bloom int8_kv_cache functionality  #741
+  - Fix crash in `gptManagerBenchmark` #649
+  - Fix Blip2 build error #695
+  - Add pickle support for `InferenceRequest` #701
+  - Fix Mixtral-8x7b build failure with custom_all_reduce #825
+  - Fix INT8 GEMM shape #935
+  - Minor bug fixes
+* Performance
+  - **[BREAKING CHANGES]** Increase default `freeGpuMemoryFraction` parameter from 0.85 to 0.9 for higher throughput
+  - **[BREAKING CHANGES]** Disable `enable_trt_overlap` argument for GPT manager by default
+  - Performance optimization of beam search kernel
+  - Add bfloat16 and paged kv cache support for optimized generation MQA/GQA kernels
+  - Custom AllReduce plugins performance optimization
+  - Top-P sampling performance optimization
+  - LoRA performance optimization
+  - Custom allreduce performance optimization by introducing a ping-pong buffer to avoid an extra synchronization cost
+  - Integrate XQA kernels for GPT-J (beamWidth=4)
+* Documentation
+  - Batch manager arguments documentation updates
+  - Add documentation for best practices for tuning the performance of TensorRT-LLM (See docs/source/perf_best_practices.md)
+  - Add documentation for Falcon AWQ support (See examples/falcon/README.md)
+  - Update to the `docs/source/checkpoint.md` documentation
+  - Update AWQ INT4 weight only quantization documentation for GPT-J
+  - Add blog: Speed up inference with SOTA quantization techniques in TRT-LLM
+  - Refine TensorRT-LLM backend README structure #133
+  - Typo fix #739
+
 ## Versions 0.7.0 / 0.7.1
 
 * Models
@@ -34,7 +109,7 @@
   - Optimize AllReduce for parallel attention on Falcon and GPT-J
   - Enable split-k for weight-only cutlass kernel when SM>=75
 * Documentation
-  - Add [documentation for new builder workflow](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/new_workflow.md)
+  - Add [documentation for convert/build workflow](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/checkpoint.md)
 
 ## Versions 0.6.0 / 0.6.1