diff --git a/CHANGELOG.md b/CHANGELOG.md
index 1c92f9f9..66139b50 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,5 +1,134 @@
# TensorRT OSS Release Changelog
+## 10.0.0 EA - 2024-04-02
+
+Key Features and Updates:
+
+ - Samples changes
+ - Added a [sample](samples/python/sample_weight_stripping) showcasing weight-stripped engines.
+ - Added a [sample](samples/python/python_plugin/circ_pad_plugin_multi_tactic.py) demonstrating the use of custom tactics with IPluginV3.
+ - Added a [sample](samples/sampleNonZeroPlugin) to showcase plugins with data-dependent output shapes, using IPluginV3.
+ - Parser changes
+ - Added a new class `IParserRefitter` that can be used to refit a TensorRT engine with the weights of an ONNX model.
+ - `kNATIVE_INSTANCENORM` is now set to ON by default.
+ - Added support for `IPluginV3` interfaces from TensorRT.
+ - Added support for `INT4` quantization.
+ - Added support for the `reduction` attribute in `ScatterElements`.
+ - Added support for `wrap` padding mode in `Pad`
+ - Plugin changes
+ - A [new plugin](plugin/scatterElementsPlugin) has been added in compliance with [ONNX ScatterElements](https://github.com/onnx/onnx/blob/main/docs/Operators.md#ScatterElements).
+ - The TensorRT plugin library no longer has a load-time link dependency on cuBLAS or cuDNN libraries.
+ - All plugins which relied on cuBLAS/cuDNN handles passed through `IPluginV2Ext::attachToContext()` have moved to use cuBLAS/cuDNN resources initialized by the plugin library itself. This works by dynamically loading the required cuBLAS/cuDNN library. Additionally, plugins which independently initialized their cuBLAS/cuDNN resources have also moved to dynamically loading the required library. If the respective library is not discoverable through the library path(s), these plugins will not work.
+ - bertQKVToContextPlugin: Version 2 of this plugin now supports head sizes less than or equal to 32.
+ - reorgPlugin: Added a version 2 which implements IPluginV2DynamicExt.
+ - disentangledAttentionPlugin: Fixed a kernel bug.
+ - Demo changes
+ - HuggingFace demos have been removed. For all users using TensorRT to accelerate Large Language Model inference, please use [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/).
+ - Updated tooling
+ - Polygraphy v0.49.9
+ - ONNX-GraphSurgeon v0.5.1
+ - TensorRT Engine Explorer v0.1.8
+ - Build Containers
+ - RedHat/CentOS 7.x are no longer officially supported starting with TensorRT 10.0. The corresponding container has been removed from TensorRT-OSS.
+
+## 9.3.0 GA - 2024-02-09
+
+Key Features and Updates:
+
+ - Demo changes
+ - Faster Text-to-image using SDXL & INT8 quantization using AMMO
+ - Updated tooling
+ - Polygraphy v0.49.7
+
+## 9.2.0 GA - 2023-11-27
+
+Key Features and Updates:
+
+ - `trtexec` enhancement: Added `--weightless` flag to mark the engine as weightless.
+ - Parser changes
+ - Added support for Hardmax operator.
+ - Changes to a few operator importers to ensure that TensorRT preserves the precision of operations when using strongly typed mode.
+ - Plugin changes
+ - Explicit INT8 support added to `bertQKVToContextPlugin`.
+ - Various bug fixes.
+ - Updated HuggingFace demo to use transformers v4.31.0 and PyTorch v2.1.0.
+
+
+## 9.1.0 GA - 2023-10-18
+
+Key Features and Updates:
+
+ - Update the [trt_python_plugin](samples/python/python_plugin) sample.
+ - Python plugins API reference is part of the offical TRT Python API.
+ - Added samples demonstrating the usage of the progress monitor API.
+ - Check [sampleProgressMonitor](samples/sampleProgressMonitor) for the C++ sample.
+ - Check [simple_progress_monitor](samples/python/simple_progress_monitor) for the Python sample.
+ - Remove dependencies related to python<3.8 in python samples as we no longer support python<3.8 for python samples.
+ - Demo changes
+ - Added LAMBADA dataset accuracy checks in the [HuggingFace](demo/HuggingFace) demo.
+ - Enabled structured sparsity and FP8 quantized batch matrix multiplication(BMM)s in attention in the [NeMo](demo/NeMo) demo.
+ - Replaced deprecated APIs in the [BERT](demo/BERT) demo.
+ - Updated tooling
+ - Polygraphy v0.49.1
+
+
+## 9.0.1 GA - 2023-09-07
+
+Key Features and Updates:
+
+ - TensorRT plugin autorhing in Python is now supported
+ - See the [trt_python_plugin](samples/python/python_plugin) sample for reference.
+ - Updated default CUDA version to 12.2
+ - Support for BLIP models, Seq2Seq and Vision2Seq abstractions in HuggingFace demo.
+ - demoDiffusion refactoring and SDXL enhancements
+ - Additional validation asserts for NV Plugins
+ - Updated tooling
+ - TensorRT Engine Explorer v0.1.7: graph rendering for TensorRT 9.0 `kgen` kernels
+ - ONNX-GraphSurgeon v0.3.29
+ - PyTorch quantization toolkit v2.2.0
+
+
+## 9.0.0 EA - 2023-08-06
+
+Key Features and Updates:
+
+ - Added the NeMo demo to demonstrate the performance benefit of using E4M3 FP8 data type with the GPT models trained with the [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo) and [TransformerEngine](https://github.com/NVIDIA/TransformerEngine).
+ - Demo Diffusion updates
+ - Added SDXL 1.0 txt2img pipeline
+ - Added ControlNet pipeline
+ - Huggingface demo updates
+ - Added Flan-T5, OPT, BLOOM, BLOOMZ, GPT-Neo, GPT-NeoX, Cerebras-GPT support with accuracy check
+ - Refactored code and extracted common utils into Seq2Seq class
+ - Optimized shape-changing overhead and achieved a >30% e2e performance gain
+ - Added stable KV-cache, beam search and fp16 support for all models
+ - Added dynamic batch size TRT inference
+ - Added uneven-length multi-batch inference with attention_mask support
+ - Added `chat` command – interactive CLI
+ - Upgraded PyTorch and HuggingFace version to support Hopper GPU
+ - Updated notebooks with much simplified demo API.
+
+ - Added two new TensorRT samples: sampleProgressMonitor (C++) and simple_progress_reporter (Python) that are examples for using Progress Monitor during engine build.
+ - The following plugins were deprecated:
+ - ``BatchedNMS_TRT``
+ - ``BatchedNMSDynamic_TRT``
+ - ``BatchTilePlugin_TRT``
+ - ``Clip_TRT``
+ - ``CoordConvAC``
+ - ``CropAndResize``
+ - ``EfficientNMS_ONNX_TRT``
+ - ``CustomGeluPluginDynamic``
+ - ``LReLU_TRT``
+ - ``NMSDynamic_TRT``
+ - ``NMS_TRT``
+ - ``Normalize_TRT``
+ - ``Proposal``
+ - ``SingleStepLSTMPlugin``
+ - ``SpecialSlice_TRT``
+ - ``Split``
+
+ - Ubuntu 18.04 has reached end of life and is no longer supported by TensorRT starting with 9.0, and the corresponding Dockerfile(s) have been removed.
+ - Support for aarch64 builds will not be available in this release, and the corresponding Dockerfiles have been removed.
+
## [8.6.1 GA](https://docs.nvidia.com/deeplearning/tensorrt/release-notes/#rel-8-6-1) - 2023-05-02
TensorRT OSS release corresponding to TensorRT 8.6.1.6 GA release.
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 66f4201b..5d29b78e 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -22,21 +22,41 @@ include(cmake/modules/find_library_create_target.cmake)
set_ifndef(TRT_LIB_DIR ${CMAKE_BINARY_DIR})
set_ifndef(TRT_OUT_DIR ${CMAKE_BINARY_DIR})
+# Converts Windows paths
+if(CMAKE_VERSION VERSION_LESS 3.20)
+ file(TO_CMAKE_PATH "${TRT_LIB_DIR}" TRT_LIB_DIR)
+ file(TO_CMAKE_PATH "${TRT_OUT_DIR}" TRT_OUT_DIR)
+else()
+ cmake_path(SET TRT_LIB_DIR ${TRT_LIB_DIR})
+ cmake_path(SET TRT_OUT_DIR ${TRT_OUT_DIR})
+endif()
+
+# Required to export symbols to build *.libs
+if(WIN32)
+ add_compile_definitions(TENSORRT_BUILD_LIB 1)
+endif()
+
+# Set output paths
+set(RUNTIME_OUTPUT_DIRECTORY ${TRT_OUT_DIR} CACHE PATH "Output directory for runtime target files")
+set(LIBRARY_OUTPUT_DIRECTORY ${TRT_OUT_DIR} CACHE PATH "Output directory for library target files")
+set(ARCHIVE_OUTPUT_DIRECTORY ${TRT_OUT_DIR} CACHE PATH "Output directory for archive target files")
+
+if(WIN32)
+ set(STATIC_LIB_EXT "lib")
+else()
+ set(STATIC_LIB_EXT "a")
+endif()
+
file(STRINGS "${CMAKE_CURRENT_SOURCE_DIR}/include/NvInferVersion.h" VERSION_STRINGS REGEX "#define NV_TENSORRT_.*")
foreach(TYPE MAJOR MINOR PATCH BUILD)
- string(REGEX MATCH "NV_TENSORRT_${TYPE} [0-9]" TRT_TYPE_STRING ${VERSION_STRINGS})
- string(REGEX MATCH "[0-9]" TRT_${TYPE} ${TRT_TYPE_STRING})
-endforeach(TYPE)
-
-foreach(TYPE MAJOR MINOR PATCH)
- string(REGEX MATCH "NV_TENSORRT_SONAME_${TYPE} [0-9]" TRT_TYPE_STRING ${VERSION_STRINGS})
- string(REGEX MATCH "[0-9]" TRT_SO_${TYPE} ${TRT_TYPE_STRING})
+ string(REGEX MATCH "NV_TENSORRT_${TYPE} [0-9]+" TRT_TYPE_STRING ${VERSION_STRINGS})
+ string(REGEX MATCH "[0-9]+" TRT_${TYPE} ${TRT_TYPE_STRING})
endforeach(TYPE)
set(TRT_VERSION "${TRT_MAJOR}.${TRT_MINOR}.${TRT_PATCH}" CACHE STRING "TensorRT project version")
set(ONNX2TRT_VERSION "${TRT_MAJOR}.${TRT_MINOR}.${TRT_PATCH}" CACHE STRING "ONNX2TRT project version")
-set(TRT_SOVERSION "${TRT_SO_MAJOR}" CACHE STRING "TensorRT library so version")
+set(TRT_SOVERSION "${TRT_MAJOR}" CACHE STRING "TensorRT library so version")
message("Building for TensorRT version: ${TRT_VERSION}, library version: ${TRT_SOVERSION}")
if(NOT DEFINED CMAKE_TOOLCHAIN_FILE)
@@ -88,8 +108,8 @@ endif()
############################################################################################
# Dependencies
-set(DEFAULT_CUDA_VERSION 12.0.1)
-set(DEFAULT_CUDNN_VERSION 8.8)
+set(DEFAULT_CUDA_VERSION 12.2.0)
+set(DEFAULT_CUDNN_VERSION 8.9)
set(DEFAULT_PROTOBUF_VERSION 3.20.1)
# Dependency Version Resolution
@@ -118,20 +138,12 @@ endif()
include_directories(
${CUDA_INCLUDE_DIRS}
- ${CUDNN_ROOT_DIR}/include
)
-find_library(CUDNN_LIB cudnn HINTS
- ${CUDA_TOOLKIT_ROOT_DIR} ${CUDNN_ROOT_DIR} PATH_SUFFIXES lib64 lib/x64 lib)
-find_library(CUBLAS_LIB cublas HINTS
- ${CUDA_TOOLKIT_ROOT_DIR} PATH_SUFFIXES lib64 lib lib/x64 lib/stubs)
-find_library(CUBLASLT_LIB cublasLt HINTS
- ${CUDA_TOOLKIT_ROOT_DIR} PATH_SUFFIXES lib64 lib lib/x64 lib/stubs)
if(BUILD_PARSERS)
configure_protobuf(${PROTOBUF_VERSION})
endif()
find_library_create_target(nvinfer nvinfer SHARED ${TRT_LIB_DIR})
-find_library_create_target(nvuffparser nvparsers SHARED ${TRT_LIB_DIR})
find_library(CUDART_LIB cudart_static HINTS ${CUDA_TOOLKIT_ROOT_DIR} PATH_SUFFIXES lib lib/x64 lib64)
@@ -149,18 +161,11 @@ if (DEFINED GPU_ARCHS)
separate_arguments(GPU_ARCHS)
else()
list(APPEND GPU_ARCHS
- 53
- 60
- 61
70
75
)
string(REGEX MATCH "aarch64" IS_ARM "${TRT_PLATFORM_ID}")
- if (IS_ARM)
- # Xavier (SM72) only supported for aarch64.
- list(APPEND GPU_ARCHS 72)
- endif()
if (CUDA_VERSION VERSION_GREATER_EQUAL 11.0)
# Ampere GPU (SM80) support is only available in CUDA versions > 11.0
@@ -189,10 +194,10 @@ if (${LATEST_SM} GREATER_EQUAL 70)
endif()
if(NOT MSVC)
- set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Xcompiler -Wno-deprecated-declarations")
+ set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --expt-relaxed-constexpr -Xcompiler -Wno-deprecated-declarations")
else()
set(CMAKE_CUDA_SEPARABLE_COMPILATION ON)
- set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Xcompiler")
+ set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --expt-relaxed-constexpr -Xcompiler")
endif()
############################################################################################
@@ -207,7 +212,6 @@ endif()
if(BUILD_PARSERS)
add_subdirectory(parsers)
else()
- find_library_create_target(nvcaffeparser nvparsers SHARED ${TRT_OUT_DIR} ${TRT_LIB_DIR})
find_library_create_target(nvonnxparser nvonnxparser SHARED ${TRT_OUT_DIR} ${TRT_LIB_DIR})
endif()
diff --git a/README.md b/README.md
index d31f2c4c..28a3edba 100644
--- a/README.md
+++ b/README.md
@@ -1,7 +1,7 @@
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![Documentation](https://img.shields.io/badge/TensorRT-documentation-brightgreen.svg)](https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html)
# TensorRT Open Source Software
-This repository contains the Open Source Software (OSS) components of NVIDIA TensorRT. It includes the sources for TensorRT plugins and parsers (Caffe and ONNX), as well as sample applications demonstrating usage and capabilities of the TensorRT platform. These open source software components are a subset of the TensorRT General Availability (GA) release with some extensions and bug-fixes.
+This repository contains the Open Source Software (OSS) components of NVIDIA TensorRT. It includes the sources for TensorRT plugins and ONNX parser, as well as sample applications demonstrating usage and capabilities of the TensorRT platform. These open source software components are a subset of the TensorRT General Availability (GA) release with some extensions and bug-fixes.
* For code contributions to TensorRT-OSS, please see our [Contribution Guide](CONTRIBUTING.md) and [Coding Guidelines](CODING-GUIDELINES.md).
* For a summary of new additions and updates shipped with TensorRT-OSS releases, please refer to the [Changelog](CHANGELOG.md).
@@ -26,16 +26,17 @@ You can skip the **Build** section to enjoy TensorRT with Python.
To build the TensorRT-OSS components, you will first need the following software packages.
**TensorRT GA build**
-* [TensorRT](https://developer.nvidia.com/nvidia-tensorrt-download) v8.6.1.6
+* TensorRT v10.0.0.6
+ * Available from direct download links listed below
**System Packages**
* [CUDA](https://developer.nvidia.com/cuda-toolkit)
* Recommended versions:
- * cuda-12.0.1 + cuDNN-8.8
- * cuda-11.8.0 + cuDNN-8.8
+ * cuda-12.2.0 + cuDNN-8.9
+ * cuda-11.8.0 + cuDNN-8.9
* [GNU make](https://ftp.gnu.org/gnu/make/) >= v4.1
* [cmake](https://github.com/Kitware/CMake/releases) >= v3.13
-* [python]() >= v3.6.9, <= v3.10.x
+* [python]() >= v3.8, <= v3.10.x
* [pip](https://pypi.org/project/pip/#history) >= v19.0
* Essential utilities
* [git](https://git-scm.com/downloads), [pkg-config](https://www.freedesktop.org/wiki/Software/pkg-config/), [wget](https://www.gnu.org/software/wget/faq.html#download)
@@ -44,9 +45,6 @@ To build the TensorRT-OSS components, you will first need the following software
* Containerized build
* [Docker](https://docs.docker.com/install/) >= 19.03
* [NVIDIA Container Toolkit](https://github.com/NVIDIA/nvidia-docker)
-* Toolchains and SDKs
- * (Cross compilation for Jetson platform) [NVIDIA JetPack](https://developer.nvidia.com/embedded/jetpack) >= 5.0 (current support only for TensorRT 8.4.0 and TensorRT 8.5.2)
- * (Cross compilation for QNX platform) [QNX Toolchain](https://blackberry.qnx.com/en)
* PyPI packages (for demo applications/tests)
* [onnx](https://pypi.org/project/onnx/)
* [onnxruntime](https://pypi.org/project/onnxruntime/)
@@ -74,24 +72,19 @@ To build the TensorRT-OSS components, you will first need the following software
If using the TensorRT OSS build container, TensorRT libraries are preinstalled under `/usr/lib/x86_64-linux-gnu` and you may skip this step.
- Else download and extract the TensorRT GA build from [NVIDIA Developer Zone](https://developer.nvidia.com/nvidia-tensorrt-download).
+ Else download and extract the TensorRT GA build from [NVIDIA Developer Zone](https://developer.nvidia.com) with the direct links below:
+ - [TensorRT 10.0.0.6 for CUDA 11.8, Linux x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.0.0/tars/TensorRT-10.0.0.6.Linux.x86_64-gnu.cuda-11.8.tar.gz)
+ - [TensorRT 10.0.0.6 for CUDA 12.4, Linux x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.0.0/tars/TensorRT-10.0.0.6.Linux.x86_64-gnu.cuda-12.4.tar.gz)
- **Example: Ubuntu 20.04 on x86-64 with cuda-12.0**
+
+ **Example: Ubuntu 20.04 on x86-64 with cuda-12.4**
```bash
cd ~/Downloads
- tar -xvzf TensorRT-8.6.1.6.Linux.x86_64-gnu.cuda-12.0.tar.gz
- export TRT_LIBPATH=`pwd`/TensorRT-8.6.1.6
+ tar -xvzf TensorRT-10.0.0.6.Linux.x86_64-gnu.cuda-12.4.tar.gz
+ export TRT_LIBPATH=`pwd`/TensorRT-10.0.0.6
```
-
-3. #### (Optional - for Jetson builds only) Download the JetPack SDK
- 1. Download and launch the JetPack SDK manager. Login with your NVIDIA developer account.
- 2. Select the platform and target OS (example: Jetson AGX Xavier, `Linux Jetpack 5.0`), and click Continue.
- 3. Under `Download & Install Options` change the download folder and select `Download now, Install later`. Agree to the license terms and click Continue.
- 4. Move the extracted files into the `/docker/jetpack_files` folder.
-
-
## Setting Up The Build Environment
For Linux platforms, we recommend that you generate a docker container for building TensorRT OSS as described below. For native builds, please install the [prerequisite](#prerequisites) *System Packages*.
@@ -99,27 +92,16 @@ For Linux platforms, we recommend that you generate a docker container for build
1. #### Generate the TensorRT-OSS build container.
The TensorRT-OSS build container can be generated using the supplied Dockerfiles and build scripts. The build containers are configured for building TensorRT OSS out-of-the-box.
- **Example: Ubuntu 20.04 on x86-64 with cuda-12.0 (default)**
- ```bash
- ./docker/build.sh --file docker/ubuntu-20.04.Dockerfile --tag tensorrt-ubuntu20.04-cuda12.0
- ```
- **Example: CentOS/RedHat 7 on x86-64 with cuda-11.8**
- ```bash
- ./docker/build.sh --file docker/centos-7.Dockerfile --tag tensorrt-centos7-cuda11.8 --cuda 11.8.0
- ```
- **Example: Ubuntu 20.04 cross-compile for Jetson (aarch64) with cuda-11.4.2 (JetPack SDK)**
- ```bash
- ./docker/build.sh --file docker/ubuntu-cross-aarch64.Dockerfile --tag tensorrt-jetpack-cuda11.4
- ```
- **Example: Ubuntu 20.04 on aarch64 with cuda-11.8**
+ **Example: Ubuntu 20.04 on x86-64 with cuda-12.3.2 (default)**
```bash
- ./docker/build.sh --file docker/ubuntu-20.04-aarch64.Dockerfile --tag tensorrt-aarch64-ubuntu20.04-cuda11.8 --cuda 11.8.0
+ ./docker/build.sh --file docker/ubuntu-20.04.Dockerfile --tag tensorrt-ubuntu20.04-cuda12.3.2
```
+
2. #### Launch the TensorRT-OSS build container.
**Example: Ubuntu 20.04 build container**
```bash
- ./docker/launch.sh --tag tensorrt-ubuntu20.04-cuda12.0 --gpus all
+ ./docker/launch.sh --tag tensorrt-ubuntu20.04-cuda12.3.2 --gpus all
```
> NOTE:
1. Use the `--tag` corresponding to build container generated in Step 1.
@@ -130,7 +112,7 @@ For Linux platforms, we recommend that you generate a docker container for build
## Building TensorRT-OSS
* Generate Makefiles and build.
- **Example: Linux (x86-64) build with default cuda-12.0**
+ **Example: Linux (x86-64) build with default cuda-12.3.2**
```bash
cd $TRT_OSSPATH
mkdir -p build && cd build
@@ -138,44 +120,8 @@ For Linux platforms, we recommend that you generate a docker container for build
make -j$(nproc)
```
- > NOTE: On CentOS7, the default g++ version does not support C++14. For native builds (not using the CentOS7 build container), first install devtoolset-8 to obtain the updated g++ toolchain as follows:
- ```bash
- yum -y install centos-release-scl
- yum-config-manager --enable rhel-server-rhscl-7-rpms
- yum -y install devtoolset-8
- export PATH="/opt/rh/devtoolset-8/root/bin:${PATH}"
- ```
-
- **Example: Linux (aarch64) build with default cuda-12.0**
- ```bash
- cd $TRT_OSSPATH
- mkdir -p build && cd build
- cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out -DCMAKE_TOOLCHAIN_FILE=$TRT_OSSPATH/cmake/toolchains/cmake_aarch64-native.toolchain
- make -j$(nproc)
- ```
-
- **Example: Native build on Jetson (aarch64) with cuda-11.4**
- ```bash
- cd $TRT_OSSPATH
- mkdir -p build && cd build
- cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out -DTRT_PLATFORM_ID=aarch64 -DCUDA_VERSION=11.4
- CC=/usr/bin/gcc make -j$(nproc)
- ```
- > NOTE: C compiler must be explicitly specified via `CC=` for native `aarch64` builds of protobuf.
-
- **Example: Ubuntu 20.04 Cross-Compile for Jetson (aarch64) with cuda-11.4 (JetPack)**
- ```bash
- cd $TRT_OSSPATH
- mkdir -p build && cd build
- cmake .. -DCMAKE_TOOLCHAIN_FILE=$TRT_OSSPATH/cmake/toolchains/cmake_aarch64.toolchain -DCUDA_VERSION=11.4 -DCUDNN_LIB=/pdk_files/cudnn/usr/lib/aarch64-linux-gnu/libcudnn.so -DCUBLAS_LIB=/usr/local/cuda-11.4/targets/aarch64-linux/lib/stubs/libcublas.so -DCUBLASLT_LIB=/usr/local/cuda-11.4/targets/aarch64-linux/lib/stubs/libcublasLt.so -DTRT_LIB_DIR=/pdk_files/tensorrt/lib
-
- make -j$(nproc)
- ```
- > NOTE: The latest JetPack SDK v5.1 only supports TensorRT 8.5.2.
-
> NOTE:
- 1. The default CUDA version used by CMake is 12.0.1. To override this, for example to 11.8, append `-DCUDA_VERSION=11.8` to the cmake command.
- 2. If samples fail to link on CentOS7, create this symbolic link: `ln -s $TRT_OUT_DIR/libnvinfer_plugin.so $TRT_OUT_DIR/libnvinfer_plugin.so.8`
+ 1. The default CUDA version used by CMake is 12.2.0. To override this, for example to 11.8, append `-DCUDA_VERSION=11.8` to the cmake command.
* Required CMake build arguments are:
- `TRT_LIB_DIR`: Path to the TensorRT installation directory containing libraries.
- `TRT_OUT_DIR`: Output directory where generated build artifacts will be copied.
@@ -193,7 +139,7 @@ For Linux platforms, we recommend that you generate a docker container for build
- Tesla T4, GeForce RTX 2080: `-DGPU_ARCHS="75"`
- Titan V, Tesla V100: `-DGPU_ARCHS="70"`
- Multiple SMs: `-DGPU_ARCHS="80 75"`
- - `TRT_PLATFORM_ID`: Bare-metal build (unlike containerized cross-compilation) on non Linux/x86 platforms must explicitly specify the target platform. Currently supported options: `x86_64` (default), `aarch64`
+ - `TRT_PLATFORM_ID`: Bare-metal build (unlike containerized cross-compilation). Currently supported options: `x86_64` (default).
# References
@@ -209,4 +155,4 @@ For Linux platforms, we recommend that you generate a docker container for build
## Known Issues
-* Please refer to [TensorRT 8.6 Release Notes](https://docs.nvidia.com/deeplearning/tensorrt/release-notes/tensorrt-8.html#tensorrt-8)
+* Please refer to [TensorRT Release Notes](https://docs.nvidia.com/deeplearning/tensorrt/release-notes)
diff --git a/VERSION b/VERSION
index 811e1c1d..efdce495 100644
--- a/VERSION
+++ b/VERSION
@@ -1 +1 @@
-8.6.1.6
+10.0.0.6
diff --git a/cmake/modules/find_library_create_target.cmake b/cmake/modules/find_library_create_target.cmake
index 1894ea51..a1d29efb 100644
--- a/cmake/modules/find_library_create_target.cmake
+++ b/cmake/modules/find_library_create_target.cmake
@@ -25,6 +25,9 @@ macro(find_library_create_target target_name lib libtype hints)
find_library(${lib}_LIB_PATH ${lib})
message(STATUS "Library that was found ${${lib}_LIB_PATH}")
add_library(${target_name} ${libtype} IMPORTED)
- set_property(TARGET ${target_name} PROPERTY IMPORTED_LOCATION ${${lib}_LIB_PATH})
+ set_property(TARGET ${target_name} PROPERTY IMPORTED_LOCATION ${${lib}_LIB_PATH}) # This should be .so or .dll file, currently its .a or .lib.
+ if (WIN32)
+ set_property(TARGET ${target_name} PROPERTY IMPORTED_IMPLIB ${${lib}_LIB_PATH}) # This should be a .lib file
+ endif()
message(STATUS "==========================================================================================")
endmacro()
diff --git a/cmake/modules/set_ifndef.cmake b/cmake/modules/set_ifndef.cmake
index c64581c6..fbdc9be1 100644
--- a/cmake/modules/set_ifndef.cmake
+++ b/cmake/modules/set_ifndef.cmake
@@ -14,7 +14,6 @@
# See the License for the specific language governing permissions and
# limitations under the License.
#
-
function (set_ifndef variable value)
if(NOT DEFINED ${variable})
set(${variable} ${value} PARENT_SCOPE)
diff --git a/cmake/toolchains/cmake_aarch64.toolchain b/cmake/toolchains/cmake_aarch64.toolchain
index 3381c0c1..3c87fd65 100644
--- a/cmake/toolchains/cmake_aarch64.toolchain
+++ b/cmake/toolchains/cmake_aarch64.toolchain
@@ -46,7 +46,13 @@ set(BUILD_LIBRARY_ONLY 1)
set(CUDA_TOOLKIT_ROOT_DIR ${CUDA_ROOT})
set(CUDA_INCLUDE_DIRS ${CUDA_ROOT}/include)
-set(RT_LIB /usr/aarch64-linux-gnu/lib/librt.so)
+find_library(RT_LIB rt PATHS /usr/aarch64-linux-gnu/lib /usr/lib/aarch64-linux-gnu)
+
+if(NOT RT_LIB)
+ message(WARNING "librt.so not found in default paths")
+endif()
+
+message("RT_LIB: ${RT_LIB}")
# Use host nvcc
set(CMAKE_CUDA_COMPILER /usr/local/cuda/bin/nvcc)
@@ -56,4 +62,4 @@ set(CMAKE_CUDA_COMPILER_FORCED TRUE)
set(CUDA_LIBS -L${CUDA_ROOT}/lib)
-set(ADDITIONAL_PLATFORM_LIB_FLAGS ${CUDA_LIBS} -lcublas -lcudart -lstdc++ -lm)
+set(ADDITIONAL_PLATFORM_LIB_FLAGS ${CUDA_LIBS} -lstdc++ -lm)
diff --git a/cmake/toolchains/cmake_aarch64_cross.toolchain b/cmake/toolchains/cmake_aarch64_cross.toolchain
new file mode 100644
index 00000000..177a82f9
--- /dev/null
+++ b/cmake/toolchains/cmake_aarch64_cross.toolchain
@@ -0,0 +1,55 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+set(CMAKE_SYSTEM_NAME Linux)
+set(CMAKE_SYSTEM_PROCESSOR aarch64)
+
+set(TRT_PLATFORM_ID "aarch64")
+
+set(CUDA_PLATFORM_ID "sbsa-linux")
+
+set(CMAKE_C_COMPILER /usr/bin/aarch64-linux-gnu-gcc-8)
+set(CMAKE_CXX_COMPILER /usr/bin/aarch64-linux-gnu-g++-8)
+
+set(CMAKE_C_FLAGS "" CACHE STRING "" FORCE)
+set(CMAKE_CXX_FLAGS "" CACHE STRING "" FORCE)
+
+set(CMAKE_C_COMPILER_TARGET aarch64-linux-gnu)
+set(CMAKE_CXX_COMPILER_TARGET aarch64-linux-gnu)
+
+set(CMAKE_C_COMPILER_FORCED TRUE)
+set(CMAKE_CXX_COMPILER_FORCED TRUE)
+
+set(CUDA_ROOT /usr/local/cuda/targets/${CUDA_PLATFORM_ID} CACHE STRING "CUDA ROOT dir")
+
+set(CUDNN_LIB /usr/lib/aarch64-linux-gnu/libcudnn.so)
+
+set(BUILD_LIBRARY_ONLY 1)
+
+set(CUDA_TOOLKIT_ROOT_DIR ${CUDA_ROOT})
+set(CUDA_INCLUDE_DIRS ${CUDA_ROOT}/include)
+
+set(RT_LIB /usr/aarch64-linux-gnu/lib/librt.so)
+
+set(CMAKE_CUDA_COMPILER /usr/local/cuda/bin/nvcc)
+set(CMAKE_CUDA_HOST_COMPILER ${CMAKE_CXX_COMPILER} CACHE STRING "" FORCE)
+set(CMAKE_CUDA_FLAGS "-I${CUDA_INCLUDE_DIRS} -Xcompiler=\"-fPIC ${CMAKE_CXX_FLAGS}\"" CACHE STRING "" FORCE)
+set(CMAKE_CUDA_COMPILER_FORCED TRUE)
+
+set(CUDA_LIBS -L${CUDA_ROOT}/lib)
+
+set(ADDITIONAL_PLATFORM_LIB_FLAGS ${CUDA_LIBS} -lcublas -lcudart -lstdc++ -lm)
diff --git a/demo/BERT/README.md b/demo/BERT/README.md
index 49d48436..f867a321 100755
--- a/demo/BERT/README.md
+++ b/demo/BERT/README.md
@@ -31,7 +31,6 @@ This subfolder of the BERT TensorFlow repository, tested and maintained by NVIDI
* [Results](#results)
* [Inference performance: NVIDIA A100](#inference-performance-nvidia-a100-40gb)
* [Inference performance: NVIDIA A30](#inference-performance-nvidia-a30)
- * [Inference performance: NVIDIA T4](#inference-performance-nvidia-t4-16gb)
## Model overview
@@ -124,6 +123,12 @@ This demo BERT application can be run within the TensorRT OSS build container. I
**Note:** Since the datasets and checkpoints are stored in the directory mounted from the host, they do *not* need to be downloaded each time the container is launched.
+**Warning:** In the event of encountering an error message stating, "Missing API key and missing Email Authentication. This command requires an API key or authentication via browser login", the recommended steps for resolution are as follows:
+* Generate an API key by logging in https://ngc.nvidia.com/setup/api-key and copy the generated API key.
+* Execute the command `ngc config set` in the docker and paste the copied API key into the prompt as directed.
+
+Completing these steps should resolve the error you encountered and allow the command to proceed successfully.
+
4. Build a TensorRT engine. To build an engine, run the `builder.py` script. For example:
```bash
mkdir -p engines && python3 builder.py -m models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_128_v19.03.1/model.ckpt -o engines/bert_large_128.engine -b 1 -s 128 --fp16 -c models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_128_v19.03.1
@@ -429,78 +434,78 @@ Results were obtained by running `scripts/inference_benchmark.sh --gpu Ampere` o
| Sequence Length | Batch Size | INT8 Latency (ms) | | | FP16 Latency (ms) | | |
|-----------------|------------|-----------------|-----------------|---------|-----------------|-----------------|---------|
| | | 95th Percentile | 99th Percentile | Average | 95th Percentile | 99th Percentile | Average |
-| 128 | 1 | 0.55 | 0.70 | 0.55 | 0.61 | 0.78 | 0.62 |
-| 128 | 2 | 0.78 | 0.78 | 0.62 | 0.72 | 0.92 | 0.73 |
-| 128 | 4 | 0.74 | 0.93 | 0.74 | 0.93 | 0.93 | 0.93 |
-| 128 | 8 | 0.95 | 0.95 | 0.94 | 1.31 | 1.31 | 1.31 |
-| 128 | 12 | 1.21 | 1.53 | 1.22 | 1.73 | 1.77 | 1.72 |
-| 128 | 16 | 1.34 | 1.34 | 1.34 | 2.09 | 2.10 | 2.07 |
-| 128 | 24 | 1.84 | 1.84 | 1.84 | 3.07 | 3.09 | 3.03 |
-| 128 | 32 | 2.27 | 2.27 | 2.26 | 3.93 | 3.94 | 3.90 |
-| 128 | 64 | 4.21 | 4.25 | 4.18 | 7.79 | 7.80 | 7.72 |
-| 128 | 128 | 8.25 | 8.26 | 8.14 | 15.41 | 15.42 | 15.27 |
-| 384 | 1 | 1.14 | 1.46 | 1.14 | 1.26 | 1.26 | 1.25 |
-| 384 | 2 | 1.31 | 1.31 | 1.31 | 1.55 | 1.55 | 1.55 |
-| 384 | 4 | 1.67 | 1.67 | 1.67 | 2.13 | 2.17 | 2.13 |
-| 384 | 8 | 2.22 | 2.22 | 2.22 | 3.36 | 3.39 | 3.35 |
-| 384 | 12 | 3.34 | 3.35 | 3.34 | 4.84 | 4.88 | 4.79 |
-| 384 | 16 | 4.04 | 4.04 | 4.04 | 6.40 | 6.46 | 6.39 |
-| 384 | 24 | 5.76 | 5.76 | 5.74 | 9.54 | 9.66 | 9.44 |
-| 384 | 32 | 7.71 | 7.71 | 7.70 | 13.02 | 13.03 | 12.90 |
-| 384 | 64 | 15.01 | 15.01 | 14.91 | 25.25 | 25.26 | 24.89 |
-| 384 | 128 | 29.26 | 29.26 | 29.13 | 49.12 | 49.25 | 48.81 |
+| 128 | 1 | 0.64 | 0.69 | 0.56 | 0.79 | 0.79 | 0.63 |
+| 128 | 2 | 0.78 | 0.78 | 0.62 | 0.80 | 0.80 | 0.73 |
+| 128 | 4 | 0.74 | 0.74 | 0.74 | 1.12 | 1.20 | 0.95 |
+| 128 | 8 | 1.22 | 1.23 | 0.96 | 1.31 | 1.31 | 1.31 |
+| 128 | 12 | 1.29 | 1.30 | 1.21 | 1.70 | 1.70 | 1.70 |
+| 128 | 16 | 1.34 | 1.34 | 1.34 | 2.10 | 2.10 | 2.08 |
+| 128 | 24 | 1.83 | 1.84 | 1.83 | 3.07 | 3.08 | 3.04 |
+| 128 | 32 | 2.25 | 2.26 | 2.25 | 3.95 | 3.95 | 3.92 |
+| 128 | 64 | 4.19 | 4.20 | 4.17 | 7.68 | 7.74 | 7.63 |
+| 128 | 128 | 8.15 | 8.16 | 8.10 | 15.45 | 15.46 | 15.30 |
+| 384 | 1 | 1.14 | 1.46 | 1.15 | 1.26 | 1.62 | 1.26 |
+| 384 | 2 | 1.32 | 1.32 | 1.32 | 1.55 | 1.55 | 1.55 |
+| 384 | 4 | 1.68 | 1.72 | 1.68 | 2.11 | 2.11 | 2.11 |
+| 384 | 8 | 2.22 | 2.23 | 2.22 | 3.38 | 3.42 | 3.35 |
+| 384 | 12 | 3.34 | 3.34 | 3.34 | 4.84 | 4.86 | 4.81 |
+| 384 | 16 | 4.02 | 4.03 | 4.02 | 6.41 | 6.41 | 6.39 |
+| 384 | 24 | 5.73 | 5.73 | 5.73 | 9.47 | 9.47 | 9.36 |
+| 384 | 32 | 7.75 | 7.77 | 7.68 | 13.05 | 13.12 | 12.92 |
+| 384 | 64 | 14.96 | 14.96 | 14.85 | 25.24 | 25.36 | 24.93 |
+| 384 | 128 | 29.13 | 29.14 | 28.89 | 49.27 | 49.37 | 48.84 |
##### BERT Large
| Sequence Length | Batch Size | INT8 Latency (ms) | | | FP16 Latency (ms) | | |
|-----------------|------------|-----------------|-----------------|---------|-----------------|-----------------|---------|
| | | 95th Percentile | 99th Percentile | Average | 95th Percentile | 99th Percentile | Average |
-| 128 | 1 | 1.24 | 1.25 | 1.24 | 1.58 | 1.60 | 1.58 |
-| 128 | 2 | 1.44 | 1.44 | 1.44 | 1.83 | 1.84 | 1.82 |
-| 128 | 4 | 1.78 | 1.79 | 1.78 | 2.54 | 2.54 | 2.53 |
-| 128 | 8 | 2.82 | 2.82 | 2.81 | 3.98 | 4.00 | 3.97 |
-| 128 | 12 | 3.11 | 3.11 | 3.11 | 5.08 | 5.12 | 5.04 |
-| 128 | 16 | 4.06 | 4.07 | 4.06 | 6.96 | 6.96 | 6.91 |
-| 128 | 24 | 5.31 | 5.32 | 5.31 | 9.69 | 9.70 | 9.63 |
-| 128 | 32 | 7.07 | 7.07 | 7.02 | 13.11 | 13.12 | 12.93 |
-| 128 | 64 | 12.97 | 13.08 | 12.89 | 24.94 | 25.22 | 24.74 |
-| 128 | 128 | 25.48 | 25.72 | 25.28 | 49.30 | 49.46 | 49.18 |
-| 384 | 1 | 2.59 | 2.59 | 2.59 | 2.98 | 2.99 | 2.98 |
-| 384 | 2 | 3.04 | 3.05 | 3.04 | 4.01 | 4.03 | 4.00 |
-| 384 | 4 | 4.03 | 4.04 | 4.03 | 5.79 | 5.79 | 5.73 |
-| 384 | 8 | 7.20 | 7.22 | 7.20 | 11.11 | 11.14 | 10.99 |
-| 384 | 12 | 9.19 | 9.20 | 9.19 | 15.47 | 15.63 | 15.39 |
-| 384 | 16 | 12.36 | 12.38 | 12.35 | 21.18 | 21.19 | 21.00 |
-| 384 | 24 | 17.77 | 17.95 | 17.68 | 31.41 | 31.42 | 30.90 |
-| 384 | 32 | 23.36 | 23.37 | 23.20 | 41.40 | 41.43 | 40.90 |
-| 384 | 64 | 45.60 | 45.61 | 45.26 | 80.07 | 80.25 | 79.50 |
-| 384 | 128 | 89.25 | 89.30 | 88.57 | 157.38 | 157.76 | 156.31 |
+| 128 | 1 | 1.24 | 1.24 | 1.23 | 1.56 | 1.56 | 1.56 |
+| 128 | 2 | 1.44 | 1.83 | 1.45 | 1.83 | 1.83 | 1.83 |
+| 128 | 4 | 1.78 | 1.78 | 1.78 | 2.55 | 2.56 | 2.55 |
+| 128 | 8 | 2.66 | 2.66 | 2.66 | 3.96 | 3.97 | 3.93 |
+| 128 | 12 | 3.11 | 3.11 | 3.10 | 5.07 | 5.12 | 5.05 |
+| 128 | 16 | 4.07 | 4.07 | 4.06 | 6.96 | 6.97 | 6.91 |
+| 128 | 24 | 5.31 | 5.32 | 5.31 | 9.72 | 9.82 | 9.63 |
+| 128 | 32 | 7.04 | 7.07 | 7.02 | 13.00 | 13.04 | 12.95 |
+| 128 | 64 | 12.96 | 12.96 | 12.86 | 24.90 | 25.07 | 24.71 |
+| 128 | 128 | 25.20 | 25.21 | 25.16 | 49.29 | 49.55 | 48.86 |
+| 384 | 1 | 2.57 | 2.57 | 2.57 | 2.98 | 2.98 | 2.98 |
+| 384 | 2 | 3.06 | 3.07 | 3.06 | 3.93 | 3.93 | 3.92 |
+| 384 | 4 | 4.03 | 4.03 | 4.03 | 5.78 | 5.79 | 5.74 |
+| 384 | 8 | 7.20 | 7.21 | 7.19 | 11.16 | 11.19 | 11.04 |
+| 384 | 12 | 9.18 | 9.18 | 9.17 | 15.51 | 15.51 | 15.39 |
+| 384 | 16 | 12.34 | 12.34 | 12.33 | 21.25 | 21.25 | 21.03 |
+| 384 | 24 | 17.74 | 17.79 | 17.69 | 31.13 | 31.14 | 30.82 |
+| 384 | 32 | 23.37 | 23.37 | 23.16 | 41.26 | 41.43 | 40.83 |
+| 384 | 64 | 45.08 | 45.09 | 45.01 | 79.88 | 80.21 | 79.18 |
+| 384 | 128 | 88.34 | 88.37 | 88.06 | 156.43 | 157.17 | 155.47 |
##### Megatron Large with Sparsity
| Sequence Length | Batch Size | INT8 QAT Latency (ms) | | |
|-----------------|------------|-----------------|-----------------|---------|
| | | 95th Percentile | 99th Percentile | Average |
-| 128 | 1 | 1.29 | 1.54 | 1.29 |
-| 128 | 2 | 1.35 | 1.71 | 1.35 |
-| 128 | 4 | 1.79 | 2.14 | 1.79 |
+| 128 | 1 | 1.17 | 1.48 | 1.18 |
+| 128 | 2 | 1.49 | 1.88 | 1.50 |
+| 128 | 4 | 1.79 | 1.79 | 1.79 |
| 128 | 8 | 2.54 | 2.54 | 2.53 |
-| 128 | 12 | 2.93 | 2.93 | 2.92 |
-| 128 | 16 | 3.95 | 3.95 | 3.94 |
-| 128 | 24 | 4.93 | 4.94 | 4.92 |
-| 128 | 32 | 7.13 | 7.14 | 7.12 |
-| 128 | 64 | 11.64 | 11.64 | 11.62 |
-| 128 | 128 | 21.29 | 21.46 | 21.16 |
+| 128 | 12 | 2.95 | 2.95 | 2.94 |
+| 128 | 16 | 3.97 | 3.97 | 3.96 |
+| 128 | 24 | 4.91 | 4.91 | 4.90 |
+| 128 | 32 | 6.90 | 6.92 | 6.86 |
+| 128 | 64 | 11.61 | 11.64 | 11.59 |
+| 128 | 128 | 21.34 | 21.35 | 21.21 |
| 384 | 1 | 1.71 | 1.72 | 1.71 |
-| 384 | 2 | 2.24 | 2.25 | 2.23 |
-| 384 | 4 | 3.43 | 3.44 | 3.43 |
-| 384 | 8 | 5.77 | 5.77 | 5.76 |
-| 384 | 12 | 8.39 | 8.39 | 8.37 |
-| 384 | 16 | 10.38 | 10.39 | 10.36 |
-| 384 | 24 | 14.69 | 14.70 | 14.67 |
-| 384 | 32 | 18.68 | 18.82 | 18.66 |
-| 384 | 64 | 35.88 | 35.89 | 35.70 |
-| 384 | 128 | 68.71 | 68.73 | 68.16 |
+| 384 | 2 | 2.21 | 2.21 | 2.21 |
+| 384 | 4 | 3.47 | 3.47 | 3.47 |
+| 384 | 8 | 5.75 | 5.75 | 5.74 |
+| 384 | 12 | 8.37 | 8.38 | 8.35 |
+| 384 | 16 | 10.39 | 10.40 | 10.37 |
+| 384 | 24 | 14.61 | 14.62 | 14.59 |
+| 384 | 32 | 18.80 | 18.96 | 18.78 |
+| 384 | 64 | 35.90 | 35.92 | 35.62 |
+| 384 | 128 | 67.74 | 67.77 | 67.60 |
#### Inference performance: NVIDIA A30
@@ -511,76 +516,76 @@ Results were obtained by running `scripts/inference_benchmark.sh --gpu Ampere` o
| Sequence Length | Batch Size | INT8 Latency (ms) | | | FP16 Latency (ms) | | |
|-----------------|------------|-----------------|-----------------|---------|-----------------|-----------------|---------|
| | | 95th Percentile | 99th Percentile | Average | 95th Percentile | 99th Percentile | Average |
-| 128 | 1 | 0.91 | 0.92 | 0.62 | 1.18 | 1.18 | 0.82 |
-| 128 | 2 | 1.13 | 1.13 | 0.77 | 1.07 | 1.07 | 0.97 |
-| 128 | 4 | 1.04 | 1.57 | 1.05 | 1.46 | 2.11 | 1.44 |
-| 128 | 8 | 1.46 | 1.49 | 1.44 | 2.41 | 2.41 | 2.40 |
-| 128 | 12 | 1.94 | 1.94 | 1.94 | 3.42 | 3.45 | 3.40 |
-| 128 | 16 | 2.40 | 2.46 | 2.37 | 4.33 | 4.41 | 4.28 |
-| 128 | 24 | 3.54 | 3.59 | 3.48 | 6.59 | 6.60 | 6.50 |
-| 128 | 32 | 4.46 | 4.50 | 4.43 | 8.49 | 8.55 | 8.37 |
-| 128 | 64 | 8.68 | 8.75 | 8.57 | 16.65 | 16.67 | 16.47 |
-| 128 | 128 | 16.81 | 16.83 | 16.63 | 32.40 | 32.52 | 32.04 |
-| 384 | 1 | 1.31 | 1.32 | 1.31 | 1.62 | 1.64 | 1.63 |
-| 384 | 2 | 1.66 | 1.66 | 1.66 | 2.27 | 2.27 | 2.26 |
-| 384 | 4 | 2.32 | 2.32 | 2.30 | 3.79 | 3.87 | 3.72 |
-| 384 | 8 | 4.26 | 4.26 | 4.24 | 7.26 | 7.31 | 7.17 |
-| 384 | 12 | 6.10 | 6.13 | 6.04 | 10.35 | 10.43 | 10.23 |
-| 384 | 16 | 8.17 | 8.18 | 8.08 | 13.93 | 14.05 | 13.85 |
-| 384 | 24 | 11.91 | 11.98 | 11.82 | 20.46 | 20.57 | 20.25 |
-| 384 | 32 | 15.50 | 15.64 | 15.48 | 27.06 | 27.17 | 26.81 |
-| 384 | 64 | 31.03 | 31.18 | 30.63 | 52.44 | 52.48 | 52.05 |
-| 384 | 128 | 61.10 | 61.13 | 60.50 | 103.38 | 103.64 | 102.87 |
+| 128 | 1 | 0.88 | 0.88 | 0.61 | 0.78 | 1.14 | 0.79 |
+| 128 | 2 | 1.03 | 1.04 | 0.77 | 0.97 | 1.45 | 0.98 |
+| 128 | 4 | 1.04 | 1.56 | 1.05 | 1.43 | 1.44 | 1.41 |
+| 128 | 8 | 1.44 | 1.46 | 1.43 | 2.43 | 2.44 | 2.41 |
+| 128 | 12 | 1.92 | 1.92 | 1.91 | 3.44 | 3.45 | 3.39 |
+| 128 | 16 | 2.38 | 2.43 | 2.35 | 4.36 | 4.37 | 4.28 |
+| 128 | 24 | 3.47 | 3.50 | 3.44 | 6.56 | 6.65 | 6.48 |
+| 128 | 32 | 4.42 | 4.45 | 4.38 | 8.42 | 8.58 | 8.36 |
+| 128 | 64 | 8.58 | 8.66 | 8.49 | 16.58 | 16.60 | 16.40 |
+| 128 | 128 | 16.56 | 16.62 | 16.39 | 32.13 | 32.30 | 31.93 |
+| 384 | 1 | 1.31 | 2.01 | 1.32 | 1.63 | 1.63 | 1.62 |
+| 384 | 2 | 1.67 | 1.67 | 1.66 | 2.29 | 2.35 | 2.26 |
+| 384 | 4 | 2.29 | 2.34 | 2.27 | 3.74 | 3.77 | 3.71 |
+| 384 | 8 | 4.23 | 4.24 | 4.20 | 7.25 | 7.30 | 7.15 |
+| 384 | 12 | 6.05 | 6.10 | 6.00 | 10.21 | 10.27 | 10.12 |
+| 384 | 16 | 8.07 | 8.11 | 8.02 | 13.97 | 14.05 | 13.84 |
+| 384 | 24 | 11.85 | 11.86 | 11.71 | 20.31 | 20.42 | 20.16 |
+| 384 | 32 | 15.45 | 15.47 | 15.29 | 26.86 | 27.04 | 26.65 |
+| 384 | 64 | 30.49 | 30.74 | 30.25 | 52.21 | 52.34 | 51.75 |
+| 384 | 128 | 60.21 | 60.48 | 59.56 | 103.20 | 103.58 | 102.66 |
##### BERT Large
| Sequence Length | Batch Size | INT8 Latency (ms) | | | FP16 Latency (ms) | | |
|-----------------|------------|-----------------|-----------------|---------|-----------------|-----------------|---------|
| | | 95th Percentile | 99th Percentile | Average | 95th Percentile | 99th Percentile | Average |
-| 128 | 1 | 1.49 | 1.49 | 1.48 | 2.03 | 2.03 | 2.02 |
-| 128 | 2 | 1.83 | 1.84 | 1.82 | 2.79 | 2.79 | 2.76 |
-| 128 | 4 | 2.70 | 2.70 | 2.68 | 4.35 | 4.40 | 4.31 |
-| 128 | 8 | 4.50 | 4.52 | 4.47 | 8.07 | 8.17 | 8.01 |
-| 128 | 12 | 5.67 | 5.69 | 5.62 | 10.67 | 10.75 | 10.53 |
-| 128 | 16 | 8.08 | 8.13 | 7.95 | 14.86 | 14.86 | 14.72 |
-| 128 | 24 | 10.59 | 10.60 | 10.47 | 20.71 | 20.73 | 20.47 |
-| 128 | 32 | 14.16 | 14.21 | 14.03 | 28.21 | 28.37 | 27.98 |
-| 128 | 64 | 26.77 | 26.95 | 26.66 | 54.03 | 54.33 | 53.43 |
-| 128 | 128 | 52.65 | 52.78 | 52.12 | 106.15 | 106.75 | 105.37 |
-| 384 | 1 | 3.20 | 3.21 | 3.20 | 4.19 | 4.19 | 4.17 |
-| 384 | 2 | 4.26 | 4.26 | 4.22 | 6.61 | 6.63 | 6.56 |
-| 384 | 4 | 7.56 | 7.64 | 7.55 | 12.04 | 12.05 | 11.93 |
-| 384 | 8 | 13.01 | 13.07 | 12.84 | 22.81 | 22.89 | 22.56 |
-| 384 | 12 | 18.73 | 18.82 | 18.56 | 33.47 | 33.62 | 33.43 |
-| 384 | 16 | 24.41 | 24.51 | 24.16 | 44.45 | 44.47 | 44.03 |
-| 384 | 24 | 35.83 | 36.19 | 35.53 | 65.53 | 65.79 | 64.91 |
-| 384 | 32 | 47.34 | 47.52 | 46.86 | 85.92 | 86.16 | 85.15 |
-| 384 | 64 | 92.68 | 93.00 | 91.86 | 169.51 | 170.03 | 168.46 |
-| 384 | 128 | 181.91 | 182.29 | 181.02 | 334.01 | 334.51 | 332.81 |
+| 128 | 1 | 1.46 | 1.46 | 1.45 | 2.01 | 2.01 | 2.01 |
+| 128 | 2 | 1.83 | 1.85 | 1.83 | 2.80 | 2.83 | 2.75 |
+| 128 | 4 | 2.71 | 2.71 | 2.69 | 4.34 | 4.36 | 4.29 |
+| 128 | 8 | 4.33 | 4.35 | 4.28 | 8.12 | 8.20 | 8.03 |
+| 128 | 12 | 5.71 | 5.72 | 5.61 | 10.65 | 10.65 | 10.51 |
+| 128 | 16 | 7.62 | 7.64 | 7.55 | 14.57 | 14.66 | 14.55 |
+| 128 | 24 | 10.58 | 10.62 | 10.46 | 20.64 | 20.79 | 20.45 |
+| 128 | 32 | 14.18 | 14.26 | 13.99 | 28.17 | 28.31 | 28.01 |
+| 128 | 64 | 26.87 | 27.00 | 26.61 | 53.44 | 53.71 | 53.31 |
+| 128 | 128 | 52.36 | 52.71 | 51.90 | 105.42 | 105.95 | 104.96 |
+| 384 | 1 | 3.33 | 3.33 | 3.33 | 4.23 | 4.24 | 4.19 |
+| 384 | 2 | 4.26 | 4.26 | 4.23 | 6.63 | 6.65 | 6.57 |
+| 384 | 4 | 7.26 | 7.26 | 7.25 | 12.00 | 12.06 | 11.88 |
+| 384 | 8 | 12.91 | 12.99 | 12.83 | 22.61 | 22.69 | 22.45 |
+| 384 | 12 | 18.73 | 18.85 | 18.53 | 33.43 | 33.64 | 33.28 |
+| 384 | 16 | 24.06 | 24.22 | 24.02 | 44.35 | 44.64 | 44.06 |
+| 384 | 24 | 35.83 | 35.95 | 35.49 | 64.84 | 64.90 | 64.78 |
+| 384 | 32 | 47.05 | 47.27 | 46.73 | 85.89 | 86.17 | 85.11 |
+| 384 | 64 | 92.09 | 92.32 | 91.34 | 168.09 | 168.48 | 167.24 |
+| 384 | 128 | 180.47 | 180.90 | 179.75 | 330.71 | 331.31 | 329.53 |
##### Megatron Large with Sparsity
| Sequence Length | Batch Size | INT8 QAT Latency (ms) | | |
|-----------------|------------|-----------------|-----------------|---------|
| | | 95th Percentile | 99th Percentile | Average |
-| 128 | 1 | 1.46 | 1.47 | 1.45 |
-| 128 | 2 | 1.88 | 1.88 | 1.87 |
-| 128 | 4 | 2.74 | 2.74 | 2.73 |
-| 128 | 8 | 4.11 | 4.12 | 4.10 |
-| 128 | 12 | 5.29 | 5.35 | 5.25 |
-| 128 | 16 | 7.52 | 7.57 | 7.50 |
-| 128 | 24 | 10.11 | 10.19 | 10.06 |
-| 128 | 32 | 12.85 | 12.90 | 12.80 |
-| 128 | 64 | 24.50 | 24.52 | 24.26 |
-| 128 | 128 | 46.24 | 46.57 | 45.92 |
-| 384 | 1 | 2.35 | 2.36 | 2.35 |
-| 384 | 2 | 3.90 | 3.91 | 3.89 |
-| 384 | 4 | 6.14 | 6.15 | 6.08 |
-| 384 | 8 | 11.74 | 11.76 | 11.64 |
-| 384 | 12 | 15.86 | 15.88 | 15.74 |
-| 384 | 16 | 21.21 | 21.27 | 21.05 |
-| 384 | 24 | 30.03 | 30.04 | 29.89 |
-| 384 | 32 | 40.20 | 40.22 | 40.05 |
-| 384 | 64 | 76.82 | 77.11 | 76.52 |
-| 384 | 128 | 149.54 | 149.80 | 148.78 |
+| 128 | 1 | 1.44 | 1.45 | 1.44 |
+| 128 | 2 | 1.84 | 1.84 | 1.84 |
+| 128 | 4 | 2.76 | 2.76 | 2.75 |
+| 128 | 8 | 4.12 | 4.12 | 4.11 |
+| 128 | 12 | 5.26 | 5.28 | 5.22 |
+| 128 | 16 | 7.52 | 7.52 | 7.51 |
+| 128 | 24 | 9.97 | 9.99 | 9.89 |
+| 128 | 32 | 12.84 | 12.85 | 12.80 |
+| 128 | 64 | 24.35 | 24.46 | 24.15 |
+| 128 | 128 | 46.38 | 46.60 | 45.96 |
+| 384 | 1 | 2.37 | 2.37 | 2.36 |
+| 384 | 2 | 3.88 | 3.88 | 3.87 |
+| 384 | 4 | 6.10 | 6.11 | 6.05 |
+| 384 | 8 | 11.60 | 11.63 | 11.49 |
+| 384 | 12 | 15.73 | 15.78 | 15.64 |
+| 384 | 16 | 20.95 | 21.01 | 20.90 |
+| 384 | 24 | 29.83 | 29.93 | 29.71 |
+| 384 | 32 | 40.01 | 40.09 | 39.75 |
+| 384 | 64 | 76.46 | 76.67 | 76.28 |
+| 384 | 128 | 148.96 | 149.23 | 148.11 |
diff --git a/demo/BERT/builder.py b/demo/BERT/builder.py
index c6d15d00..5eafe367 100755
--- a/demo/BERT/builder.py
+++ b/demo/BERT/builder.py
@@ -40,7 +40,7 @@
TensorRT Initialization
"""
TRT_LOGGER = trt.Logger(trt.Logger.INFO)
-trt_version = [int(n) for n in trt.__version__.split('.')]
+trt_version = [n for n in trt.__version__.split('.')]
# Import necessary plugins for demoBERT
plugin_lib_name = "nvinfer_plugin.dll" if sys.platform == "win32" else "libnvinfer_plugin.so"
@@ -107,10 +107,7 @@ def attention_layer_opt(prefix, config, init_dict, network, input_tensor, imask)
Ball = init_dict[prefix + BQKV]
# FC_attention
- if config.use_int8:
- mult_all = network.add_convolution_nd(input_tensor, 3 * hidden_size, (1, 1), Wall, Ball)
- else:
- mult_all = network.add_fully_connected(input_tensor, 3 * hidden_size, Wall, Ball)
+ mult_all = network.add_convolution_nd(input_tensor, 3 * hidden_size, (1, 1), Wall, Ball)
if config.use_qat:
dr_qkv = max(
@@ -217,24 +214,20 @@ def transformer_layer_opt(prefix, config, init_dict, network, input_tensor, imas
# FC0
B_aout = init_dict[prefix + B_AOUT]
- if config.use_int8:
+ if not config.use_int8 and use_custom_fc():
+ W_aoutT = init_dict[prefix + W_AOUT + "_notrans"]
+ attention_out_fc = custom_fc(config, network, attention_heads, hidden_size, W_aoutT)
+ else:
W_aout = init_dict[prefix + W_AOUT]
attention_out_fc = network.add_convolution_nd(attention_heads, hidden_size, (1, 1), W_aout, B_aout)
B_aout = None
- if not config.use_int8_skipln:
+ if config.use_int8 and not config.use_int8_skipln:
attention_out_fc.set_output_type(0, trt.DataType.HALF if config.use_fp16 else trt.DataType.FLOAT)
- if config.use_qat:
+ if config.use_int8 and config.use_qat:
dr_fc_aout = init_dict[prefix + 'attention_output_add_local_input_quantizer_amax']
set_output_range(attention_out_fc, dr_fc_aout)
- elif use_custom_fc():
- W_aoutT = init_dict[prefix + W_AOUT + "_notrans"]
- attention_out_fc = custom_fc(config, network, attention_heads, hidden_size, W_aoutT)
- else:
- W_aout = init_dict[prefix + W_AOUT]
- attention_out_fc = network.add_fully_connected(attention_heads, hidden_size, W_aout, B_aout)
- B_aout = None
skiplayer = skipln(prefix + "attention_output_layernorm_",config, init_dict, network, attention_out_fc.get_output(0), input_tensor, B_aout)
attention_ln = skiplayer.get_output(0)
@@ -245,10 +238,7 @@ def transformer_layer_opt(prefix, config, init_dict, network, input_tensor, imas
# FC1 + GELU
B_mid = init_dict[prefix + B_MID]
W_mid = init_dict[prefix + W_MID]
- if config.use_int8:
- mid_dense = network.add_convolution_nd(attention_ln, config.intermediate_size, (1, 1), W_mid, B_mid)
- else:
- mid_dense = network.add_fully_connected(attention_ln, config.intermediate_size, W_mid, B_mid)
+ mid_dense = network.add_convolution_nd(attention_ln, config.intermediate_size, (1, 1), W_mid, B_mid)
mid_dense_out = mid_dense.get_output(0)
POW = network.add_constant((1, 1, 1, 1, 1), trt.Weights(np.ascontiguousarray([3.0], dtype=np.float32)))
@@ -281,21 +271,18 @@ def transformer_layer_opt(prefix, config, init_dict, network, input_tensor, imas
# FC2
# Dense to hidden size
B_lout = init_dict[prefix + B_LOUT]
- if config.use_int8 and not config.use_fc2_gemm:
- W_lout = init_dict[prefix + W_LOUT]
- out_dense = network.add_convolution_nd(intermediate_act, hidden_size, (1, 1), W_lout, B_lout)
- B_lout = None
-
- if not config.use_int8_skipln:
- out_dense.set_output_type(0, trt.DataType.HALF if config.use_fp16 else trt.DataType.FLOAT)
- elif use_custom_fc():
+ prefer_conv = config.use_int8 and not config.use_fc2_gemm
+ if not prefer_conv and use_custom_fc():
W_loutT = init_dict[prefix + W_LOUT + "_notrans"]
out_dense = custom_fc(config, network, intermediate_act, hidden_size, W_loutT)
else:
W_lout = init_dict[prefix + W_LOUT]
- out_dense = network.add_fully_connected(intermediate_act, hidden_size, W_lout, B_lout)
+ out_dense = network.add_convolution_nd(intermediate_act, hidden_size, (1, 1), W_lout, B_lout)
B_lout = None
+ if config.use_int8 and not config.use_int8_skipln:
+ out_dense.set_output_type(0, trt.DataType.HALF if config.use_fp16 else trt.DataType.FLOAT)
+
if config.use_qat:
dr_fc_out = init_dict[prefix + 'output_add_local_input_quantizer_amax']
set_output_range(out_dense, dr_fc_out)
@@ -334,7 +321,7 @@ def squad_output(prefix, config, init_dict, network, input_tensor):
B_out = init_dict[prefix + SQD_B]
W = network.add_constant((1, hidden_size, 2), W_out)
- dense = network.add_fully_connected(input_tensor, 2, W_out, B_out)
+ dense = network.add_convolution_nd(input_tensor, 2, (1, 1), W_out, B_out)
OUT = network.add_shuffle(dense.get_output(0))
OUT.second_transpose = (1, 0, 2, 3, 4)
@@ -399,11 +386,16 @@ def emb_layernorm(builder, network, config, weights_dict, builder_config, sequen
return emb_layer
def build_engine(batch_sizes, workspace_size, sequence_lengths, config, weights_dict, squad_json, vocab_file, calibrationCacheFile, calib_num, verbose):
- explicit_batch_flag = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
- with trt.Builder(TRT_LOGGER) as builder, builder.create_network(explicit_batch_flag) as network, builder.create_builder_config() as builder_config:
- builder_config.max_workspace_size = workspace_size * (1024 * 1024)
+ network_creation_flag = 0
+ if "EXPLICIT_BATCH" in trt.NetworkDefinitionCreationFlag.__members__.keys():
+ network_creation_flag = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
+
+ with trt.Builder(TRT_LOGGER) as builder, builder.create_network(network_creation_flag) as network, builder.create_builder_config() as builder_config:
+ builder_config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, workspace_size * (1024 * 1024))
builder_config.avg_timing_iterations = 8
+ # Cublas tactics can be unset once the qkv plugin does not use it anymore.
+ builder_config.set_tactic_sources(builder_config.get_tactic_sources() | 1 << int(trt.TacticSource.CUBLAS))
if config.use_fp16:
builder_config.set_flag(trt.BuilderFlag.FP16)
if config.use_int8:
@@ -413,7 +405,9 @@ def build_engine(batch_sizes, workspace_size, sequence_lengths, config, weights_
builder_config.set_quantization_flag(trt.QuantizationFlag.CALIBRATE_BEFORE_FUSION)
builder_config.int8_calibrator = calibrator
if config.use_strict:
- builder_config.set_flag(trt.BuilderFlag.STRICT_TYPES)
+ builder_config.set_flag(trt.BuilderFlag.PREFER_PRECISION_CONSTRAINTS)
+ builder_config.set_flag(trt.BuilderFlag.DIRECT_IO)
+ builder_config.set_flag(trt.BuilderFlag.REJECT_EMPTY_ALGORITHMS)
if verbose:
builder_config.profiling_verbosity = trt.ProfilingVerbosity.DETAILED
@@ -425,7 +419,7 @@ def build_engine(batch_sizes, workspace_size, sequence_lengths, config, weights_
# speed up the engine build for trt major version >= 8
# 1. disable cudnn tactic
# 2. load global timing cache
- if trt_version[0] >= 8:
+ if int(trt_version[0]) >= 8:
tactic_source = builder_config.get_tactic_sources() & ~(1 << int(trt.TacticSource.CUDNN))
builder_config.set_tactic_sources(tactic_source)
if config.timing_cache != None:
@@ -451,15 +445,16 @@ def build_engine(batch_sizes, workspace_size, sequence_lengths, config, weights_
squad_logits = squad_output("cls_", config, weights_dict, network, bert_out)
squad_logits_out = squad_logits.get_output(0)
+ squad_logits_out.name = "logits_out"
network.mark_output(squad_logits_out)
build_start_time = time.time()
- engine = builder.build_engine(network, builder_config)
+ serialized_engine = builder.build_serialized_network(network, builder_config)
build_time_elapsed = (time.time() - build_start_time)
TRT_LOGGER.log(TRT_LOGGER.INFO, "build engine in {:.3f} Sec".format(build_time_elapsed))
# save global timing cache
- if trt_version[0] >= 8 and config.timing_cache != None:
+ if int(trt_version[0]) >= 8 and config.timing_cache != None:
cache = builder_config.get_timing_cache()
with cache.serialize() as buffer:
with open(config.timing_cache, "wb") as f:
@@ -469,7 +464,7 @@ def build_engine(batch_sizes, workspace_size, sequence_lengths, config, weights_
if config.use_int8 and not config.use_qat:
calibrator.free()
- return engine
+ return serialized_engine
def generate_calibration_cache(sequence_lengths, workspace_size, config, weights_dict, squad_json, vocab_file, calibrationCacheFile, calib_num):
"""
@@ -488,7 +483,7 @@ def generate_calibration_cache(sequence_lengths, workspace_size, config, weights
config.use_fp16 = False
config.is_calib_mode = True
- with build_engine([1], workspace_size, sequence_lengths, config, weights_dict, squad_json, vocab_file, calibrationCacheFile, calib_num, False) as engine:
+ with build_engine([1], workspace_size, sequence_lengths, config, weights_dict, squad_json, vocab_file, calibrationCacheFile, calib_num, False) as serialized_engine:
TRT_LOGGER.log(TRT_LOGGER.INFO, "calibration cache generated in {:}".format(calibrationCacheFile))
config.use_fp16 = saved_use_fp16
@@ -553,9 +548,7 @@ def main():
else:
raise RuntimeError("You need either specify TF checkpoint using option --ckpt or ONNX using option --onnx to build TRT BERT model.")
- with build_engine(args.batch_size, args.workspace_size, args.sequence_length, config, weights_dict, args.squad_json, args.vocab_file, calib_cache, args.calib_num, args.verbose) as engine:
- TRT_LOGGER.log(TRT_LOGGER.VERBOSE, "Serializing Engine...")
- serialized_engine = engine.serialize()
+ with build_engine(args.batch_size, args.workspace_size, args.sequence_length, config, weights_dict, args.squad_json, args.vocab_file, calib_cache, args.calib_num, args.verbose) as serialized_engine:
TRT_LOGGER.log(TRT_LOGGER.INFO, "Saving Engine to {:}".format(args.output))
with open(args.output, "wb") as fout:
fout.write(serialized_engine)
diff --git a/demo/BERT/builder_varseqlen.py b/demo/BERT/builder_varseqlen.py
index 0c1aeaac..ad25ef0c 100755
--- a/demo/BERT/builder_varseqlen.py
+++ b/demo/BERT/builder_varseqlen.py
@@ -39,7 +39,7 @@
TensorRT Initialization
"""
TRT_LOGGER = trt.Logger(trt.Logger.INFO)
-trt_version = [int(n) for n in trt.__version__.split('.')]
+trt_version = [n for n in trt.__version__.split('.')]
# Import necessary plugins for demoBERT
plugin_lib_name = "nvinfer_plugin.dll" if sys.platform == "win32" else "libnvinfer_plugin.so"
@@ -107,10 +107,7 @@ def attention_layer_opt(prefix, config, init_dict, network, input_tensor, mask_i
Ball = init_dict[prefix + BQKV]
# FC_attention
- if config.use_int8:
- mult_all = network.add_convolution_nd(input_tensor, 3 * hidden_size, (1, 1), Wall, Ball)
- else:
- mult_all = network.add_fully_connected(input_tensor, 3 * hidden_size, Wall, Ball)
+ mult_all = network.add_convolution_nd(input_tensor, 3 * hidden_size, (1, 1), Wall, Ball)
if config.use_qat:
dr_qkv = max(
@@ -202,10 +199,7 @@ def transformer_layer_opt(prefix, config, init_dict, network, input_tensor, resi
# FC0
B_aout = init_dict[prefix + B_AOUT]
W_aout = init_dict[prefix + W_AOUT]
- if config.use_int8:
- attention_out_fc = network.add_convolution_nd(attention_heads, hidden_size, (1, 1), W_aout, B_aout)
- else:
- attention_out_fc = network.add_fully_connected(attention_heads, hidden_size, W_aout, B_aout)
+ attention_out_fc = network.add_convolution_nd(attention_heads, hidden_size, (1, 1), W_aout, B_aout)
if config.use_int8 and config.use_qat:
dr_fc_aout = init_dict[prefix + 'attention_output_add_local_input_quantizer_amax']
set_output_range(attention_out_fc, dr_fc_aout)
@@ -225,10 +219,7 @@ def transformer_layer_opt(prefix, config, init_dict, network, input_tensor, resi
# FC1 + GELU
B_mid = init_dict[prefix + B_MID]
W_mid = init_dict[prefix + W_MID]
- if config.use_int8:
- mid_dense = network.add_convolution_nd(attention_ln, config.intermediate_size, (1, 1), W_mid, B_mid)
- else:
- mid_dense = network.add_fully_connected(attention_ln, config.intermediate_size, W_mid, B_mid)
+ mid_dense = network.add_convolution_nd(attention_ln, config.intermediate_size, (1, 1), W_mid, B_mid)
gelu_layer = add_gelu(network, mid_dense.get_output(0))
@@ -247,10 +238,7 @@ def transformer_layer_opt(prefix, config, init_dict, network, input_tensor, resi
B_lout = init_dict[prefix + B_LOUT]
W_lout = init_dict[prefix + W_LOUT]
- if config.use_int8:
- out_dense = network.add_convolution_nd(intermediate_act, hidden_size, (1, 1), W_lout, B_lout)
- else:
- out_dense = network.add_fully_connected(intermediate_act, hidden_size, W_lout, B_lout)
+ out_dense = network.add_convolution_nd(intermediate_act, hidden_size, (1, 1), W_lout, B_lout)
if config.use_int8 and config.use_qat:
dr_fc_out = init_dict[prefix + 'output_add_local_input_quantizer_amax']
set_output_range(out_dense, dr_fc_out)
@@ -327,6 +315,7 @@ def bert_model(config, init_dict, network, input_tensor, residual, mask_idx, cu_
squad_logits = squad_output("cls_", config, init_dict, network, prev_input)
squad_logits_out = squad_logits.get_output(0)
+ squad_logits_out.name = "logits_out"
network.mark_output(squad_logits_out)
@@ -339,11 +328,7 @@ def squad_output(prefix, config, init_dict, network, input_tensor):
W_out = init_dict[prefix + SQD_W]
B_out = init_dict[prefix + SQD_B]
- if config.use_int8:
- dense = network.add_convolution_nd(input_tensor, 2, (1, 1), W_out, B_out)
- else:
- dense = network.add_fully_connected(input_tensor, 2, W_out, B_out)
-
+ dense = network.add_convolution_nd(input_tensor, 2, (1, 1), W_out, B_out)
OUT = network.add_shuffle(dense.get_output(0))
if config.use_int8 and config.interleaved:
OUT.second_transpose = (1, 2, 0, 3)
@@ -394,10 +379,13 @@ def emb_layernorm(builder, network, config, weights_dict, builder_config, max_se
return emb_layer, cu_seqlens, max_seqlen
def build_engine(batch_sizes, workspace_size, sequence_length, config, weights_dict, squad_json, vocab_file, calibrationCacheFile, calib_num, verbose):
- explicit_batch_flag = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
- with trt.Builder(TRT_LOGGER) as builder, builder.create_network(explicit_batch_flag) as network, builder.create_builder_config() as builder_config:
- builder_config.max_workspace_size = workspace_size * (1024 * 1024)
+ network_creation_flag = 0
+ if "EXPLICIT_BATCH" in trt.NetworkDefinitionCreationFlag.__members__.keys():
+ network_creation_flag = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
+
+ with trt.Builder(TRT_LOGGER) as builder, builder.create_network(network_creation_flag) as network, builder.create_builder_config() as builder_config:
+ builder_config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, workspace_size * (1024 * 1024))
builder_config.avg_timing_iterations = 8
if config.use_fp16:
builder_config.set_flag(trt.BuilderFlag.FP16)
@@ -412,7 +400,7 @@ def build_engine(batch_sizes, workspace_size, sequence_length, config, weights_d
# speed up the engine build for trt major version >= 8
# 1. disable cudnn tactic
# 2. load global timing cache
- if trt_version[0] >= 8:
+ if int(trt_version[0]) >= 8:
tactic_source = builder_config.get_tactic_sources() & ~(1 << int(trt.TacticSource.CUDNN))
builder_config.set_tactic_sources(tactic_source)
if config.timing_cache != None:
@@ -454,12 +442,12 @@ def build_engine(batch_sizes, workspace_size, sequence_length, config, weights_d
bert_model(config, weights_dict, network, embeddings, residual, mask_idx, cu_seqlens, max_seqlen)
build_start_time = time.time()
- engine = builder.build_engine(network, builder_config)
+ serialized_engine = builder.build_serialized_network(network, builder_config)
build_time_elapsed = (time.time() - build_start_time)
TRT_LOGGER.log(TRT_LOGGER.INFO, "build engine in {:.3f} Sec".format(build_time_elapsed))
# save global timing cache
- if trt_version[0] >= 8 and config.timing_cache != None:
+ if int(trt_version[0]) >= 8 and config.timing_cache != None:
cache = builder_config.get_timing_cache()
with cache.serialize() as buffer:
with open(config.timing_cache, "wb") as f:
@@ -467,7 +455,7 @@ def build_engine(batch_sizes, workspace_size, sequence_length, config, weights_d
f.flush()
os.fsync(f)
- return engine
+ return serialized_engine
def main():
parser = argparse.ArgumentParser(description="TensorRT BERT Sample", formatter_class=argparse.ArgumentDefaultsHelpFormatter)
@@ -533,9 +521,7 @@ def main():
"PyTorch using option --pytorch, or Pickle weight dictionary using option --pickle "
"to build TRT BERT model.")
- with build_engine(args.max_batch_size, args.workspace_size, args.max_sequence_length, config, weights_dict, args.squad_json, args.vocab_file, calib_cache, args.calib_num, args.verbose) as engine:
- TRT_LOGGER.log(TRT_LOGGER.VERBOSE, "Serializing Engine...")
- serialized_engine = engine.serialize()
+ with build_engine(args.max_batch_size, args.workspace_size, args.max_sequence_length, config, weights_dict, args.squad_json, args.vocab_file, calib_cache, args.calib_num, args.verbose) as serialized_engine:
TRT_LOGGER.log(TRT_LOGGER.INFO, "Saving Engine to {:}".format(args.output))
with open(args.output, "wb") as fout:
fout.write(serialized_engine)
diff --git a/demo/BERT/infer_c/bert_infer.h b/demo/BERT/infer_c/bert_infer.h
index 827f9ba9..2f72102a 100644
--- a/demo/BERT/infer_c/bert_infer.h
+++ b/demo/BERT/infer_c/bert_infer.h
@@ -83,8 +83,7 @@ struct BertInference
}
gLogInfo << "Done\n";
- const int numBindingPerProfile = mEngine->getNbBindings() / mEngine->getNbOptimizationProfiles();
- mEnableVariableLen = numBindingPerProfile == kBERT_INPUT_NUM + 1 ? false : true;
+ mEnableVariableLen = mEngine->getNbIOTensors() == kBERT_INPUT_NUM + 1 ? false : true;
if (mEnableVariableLen)
{
gLogInfo << "Variable length is enabled\n";
@@ -153,15 +152,14 @@ struct BertInference
mDeviceBuffers.emplace_back(devBuf);
mHostOutput.resize(numOutputItems);
- mBindings.resize(mEngine->getNbBindings());
+ mBindings.resize(mEngine->getNbIOTensors() * mEngine->getNbOptimizationProfiles());
}
void prepare(int profIdx, int batchSize)
{
mContext->setOptimizationProfile(profIdx);
- const int numBindingPerProfile = mEngine->getNbBindings() / mEngine->getNbOptimizationProfiles();
- const int bindingIdxOffset = profIdx * numBindingPerProfile;
+ const int bindingIdxOffset = profIdx * mEngine->getNbIOTensors();
std::copy(mDeviceBuffers.begin(), mDeviceBuffers.end(), mBindings.begin() + bindingIdxOffset);
if (mEnableVariableLen)
@@ -169,14 +167,16 @@ struct BertInference
const int allocationSizes[] = {mSeqLength * batchSize, mSeqLength * batchSize, batchSize + 1, mSeqLength};
for (int i = 0; i < sizeof(allocationSizes)/sizeof(allocationSizes[0]); i++)
{
- mContext->setBindingDimensions(i + bindingIdxOffset, Dims{1, {allocationSizes[i]}});
+ auto const tensorName = mEngine->getIOTensorName(i % mEngine->getNbIOTensors());
+ mContext->setInputShape(tensorName, Dims{1, {allocationSizes[i]}});
}
}
else
{
for (int i = 0; i < kBERT_INPUT_NUM; i++)
{
- mContext->setBindingDimensions(i + bindingIdxOffset, Dims2(batchSize, mSeqLength));
+ auto const tensorName = mEngine->getIOTensorName(i);
+ mContext->setInputShape(tensorName, Dims2(batchSize, mSeqLength));
}
}
@@ -188,10 +188,16 @@ struct BertInference
if (mEnableGraph)
{
+ for (int32_t i = 0; i < mEngine->getNbIOTensors(); i++)
+ {
+ auto const& name = mEngine->getIOTensorName(i);
+ context->setTensorAddress(name, mBindings[i + bindingIdxOffset]);
+ }
+
cudaGraph_t graph;
cudaGraphExec_t exec;
// warm up and let mContext do cublas initialization
- bool status = mContext->enqueueV2(mBindings.data(), mStream, nullptr);
+ bool status = mContext->enqueueV3(mStream, nullptr);
if (!status)
{
gLogError << "Enqueue failed\n";
@@ -200,7 +206,7 @@ struct BertInference
gLogVerbose << "Capturing graph\n";
gpuErrChk(cudaStreamBeginCapture(mStream, cudaStreamCaptureModeRelaxed));
- status = mContext->enqueueV2(mBindings.data(), mStream, nullptr);
+ status = mContext->enqueueV3(mStream, nullptr);
if (!status)
{
gLogError << "Enqueue failed\n";
@@ -234,7 +240,7 @@ struct BertInference
}
else
{
- bool status = mContext->enqueueV2(mBindings.data(), mStream, nullptr);
+ bool status = mContext->enqueueV3(mStream, nullptr);
if (!status)
{
gLogError << "Enqueue failed\n";
@@ -259,7 +265,7 @@ struct BertInference
}
else
{
- bool status = mContext->enqueueV2(mBindings.data(), mStream, nullptr);
+ bool status = mContext->enqueueV3(mStream, nullptr);
if (!status)
{
gLogError << "Enqueue failed\n";
diff --git a/demo/BERT/inference.ipynb b/demo/BERT/inference.ipynb
index d015fd72..2882e0b6 100644
--- a/demo/BERT/inference.ipynb
+++ b/demo/BERT/inference.ipynb
@@ -19,7 +19,7 @@
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License.\n",
- "# =============================================================================="
+ "# ==============================================================================\n"
]
},
{
@@ -99,7 +99,7 @@
"paragraph_text = \"The Apollo program, also known as Project Apollo, was the third United States human spaceflight program carried out by the National Aeronautics and Space Administration (NASA), which accomplished landing the first humans on the Moon from 1969 to 1972. First conceived during Dwight D. Eisenhower's administration as a three-man spacecraft to follow the one-man Project Mercury which put the first Americans in space, Apollo was later dedicated to President John F. Kennedy's national goal of landing a man on the Moon and returning him safely to the Earth by the end of the 1960s, which he proposed in a May 25, 1961, address to Congress. Project Mercury was followed by the two-man Project Gemini. The first manned flight of Apollo was in 1968. Apollo ran from 1961 to 1972, and was supported by the two-man Gemini program which ran concurrently with it from 1962 to 1966. Gemini missions developed some of the space travel techniques that were necessary for the success of the Apollo missions. Apollo used Saturn family rockets as launch vehicles. Apollo/Saturn vehicles were also used for an Apollo Applications Program, which consisted of Skylab, a space station that supported three manned missions in 1973-74, and the Apollo-Soyuz Test Project, a joint Earth orbit mission with the Soviet Union in 1975.\"\n",
"\n",
"# Short paragraph version for BERT models with max sequence length of 128\n",
- "short_paragraph_text = \"The Apollo program was the third United States human spaceflight program. First conceived as a three-man spacecraft to follow the one-man Project Mercury which put the first Americans in space, Apollo was dedicated to President John F. Kennedy's national goal of landing a man on the Moon. The first manned flight of Apollo was in 1968. Apollo ran from 1961 to 1972 followed by the Apollo-Soyuz Test Project a joint Earth orbit mission with the Soviet Union in 1975.\""
+ "short_paragraph_text = \"The Apollo program was the third United States human spaceflight program. First conceived as a three-man spacecraft to follow the one-man Project Mercury which put the first Americans in space, Apollo was dedicated to President John F. Kennedy's national goal of landing a man on the Moon. The first manned flight of Apollo was in 1968. Apollo ran from 1961 to 1972 followed by the Apollo-Soyuz Test Project a joint Earth orbit mission with the Soviet Union in 1975.\"\n"
]
},
{
@@ -118,7 +118,7 @@
"question_text = \"What project put the first Americans into space?\"\n",
"#question_text = \"What year did the first manned Apollo flight occur?\"\n",
"#question_text = \"What President is credited with the original notion of putting Americans in space?\"\n",
- "#question_text = \"Who did the U.S. collaborate with on an Earth orbit mission in 1975?\""
+ "#question_text = \"Who did the U.S. collaborate with on an Earth orbit mission in 1975?\"\n"
]
},
{
@@ -200,7 +200,7 @@
"outputs": [],
"source": [
"import tensorrt as trt\n",
- "TRT_LOGGER = trt.Logger(trt.Logger.INFO)"
+ "TRT_LOGGER = trt.Logger(trt.Logger.INFO)\n"
]
},
{
@@ -212,7 +212,7 @@
"import ctypes\n",
"import os\n",
"\n",
- "ctypes.CDLL(\"libnvinfer_plugin.so\", mode=ctypes.RTLD_GLOBAL)"
+ "ctypes.CDLL(\"libnvinfer_plugin.so\", mode=ctypes.RTLD_GLOBAL)\n"
]
},
{
@@ -245,11 +245,12 @@
" # Specify input shapes. These must be within the min/max bounds of the active profile (0th profile in this case)\n",
" # Note that input shapes can be specified on a per-inference basis, but in this case, we only have a single shape.\n",
" for binding in range(3):\n",
- " context.set_binding_shape(binding, input_shape)\n",
+ " tensor_name = engine.get_tensor_name(binding)\n",
+ " context.set_input_shape(tensor_name, input_shape)\n",
" assert context.all_binding_shapes_specified\n",
"\n",
" # Allocate output buffer by querying the size from the context. This may be different for different input shapes.\n",
- " h_output = cuda.pagelocked_empty(tuple(context.get_binding_shape(3)), dtype=np.float32)\n",
+ " h_output = cuda.pagelocked_empty(tuple(context.get_tensor_shape(engine.get_tensor_name(3))), dtype=np.float32)\n",
" d_output = cuda.mem_alloc(h_output.nbytes)\n",
"\n",
" print(\"\\nRunning Inference...\")\n",
@@ -271,8 +272,14 @@
" cuda.memcpy_htod_async(d_inputs[1], segment_ids, stream)\n",
" cuda.memcpy_htod_async(d_inputs[2], input_mask, stream)\n",
"\n",
+ " # Setup tensor address\n",
+ " bindings = [int(d_inputs[i]) for i in range(3)] + [int(d_output)]\n",
+ "\n",
+ " for i in range(engine.num_io_tensors):\n",
+ " context.set_tensor_address(engine.get_tensor_name(i), bindings[i])\n",
+ "\n",
" # Run inference\n",
- " context.execute_async_v2(bindings=[int(d_inp) for d_inp in d_inputs] + [int(d_output)], stream_handle=stream.handle)\n",
+ " context.execute_async_v3(stream_handle=stream.handle)\n",
" # Synchronize the stream\n",
" stream.synchronize()\n",
" eval_time_elapsed += (time.time() - eval_start_time)\n",
@@ -293,7 +300,7 @@
" \n",
" print(\"-----------------------------\")\n",
" print(\"Running Inference at {:.3f} Sentences/Sec\".format(1.0/eval_time_elapsed))\n",
- " print(\"-----------------------------\")"
+ " print(\"-----------------------------\")\n"
]
},
{
@@ -329,7 +336,7 @@
" for index, output in enumerate(networkOutputs):\n",
" print(\"Processing output\")\n",
" print(\"Answer: '{}'\".format(prediction))\n",
- " print(\"with prob: {:.3f}%\".format(nbest_json[0]['probability'] * 100.0))"
+ " print(\"with prob: {:.3f}%\".format(nbest_json[0]['probability'] * 100.0))\n"
]
}
],
diff --git a/demo/BERT/inference.py b/demo/BERT/inference.py
index 2116de8f..dc172181 100644
--- a/demo/BERT/inference.py
+++ b/demo/BERT/inference.py
@@ -134,34 +134,33 @@ def question_features(tokens, question):
# select engine profile
selected_profile = -1
- num_binding_per_profile = engine.num_bindings // engine.num_optimization_profiles
for idx in range(engine.num_optimization_profiles):
- profile_shape = engine.get_profile_shape(profile_index = idx, binding = idx * num_binding_per_profile)
+ profile_shape = engine.get_tensor_profile_shape(name = "input_ids", profile_index = idx)
if profile_shape[0][0] <= args.batch_size and profile_shape[2][0] >= args.batch_size and profile_shape[0][1] <= max_seq_length and profile_shape[2][1] >= max_seq_length:
selected_profile = idx
break
if selected_profile == -1:
raise RuntimeError("Could not find any profile that can run batch size {}.".format(args.batch_size))
- context.active_optimization_profile = selected_profile
- binding_idx_offset = selected_profile * num_binding_per_profile
+ # Create a stream in which to copy inputs/outputs and run inference.
+ stream = cuda.Stream()
+
+ context.set_optimization_profile_async(selected_profile, stream.handle)
+ binding_idx_offset = selected_profile * engine.num_io_tensors
# Specify input shapes. These must be within the min/max bounds of the active profile
# Note that input shapes can be specified on a per-inference basis, but in this case, we only have a single shape.
input_shape = (args.batch_size, max_seq_length)
input_nbytes = trt.volume(input_shape) * trt.int32.itemsize
- for binding in range(3):
- context.set_binding_shape(binding_idx_offset + binding, input_shape)
- assert context.all_binding_shapes_specified
-
- # Create a stream in which to copy inputs/outputs and run inference.
- stream = cuda.Stream()
+ for name in ["input_ids", "segment_ids", "input_mask"]:
+ context.set_input_shape(name, input_shape)
+ assert len(context.infer_shapes()) == 0
# Allocate device memory for inputs.
d_inputs = [cuda.mem_alloc(input_nbytes) for binding in range(3)]
# Allocate output buffer by querying the size from the context. This may be different for different input shapes.
- h_output = cuda.pagelocked_empty(tuple(context.get_binding_shape(binding_idx_offset + 3)), dtype=np.float32)
+ h_output = cuda.pagelocked_empty(tuple(context.get_tensor_shape("logits_out")), dtype=np.float32)
d_output = cuda.mem_alloc(h_output.nbytes)
def inference(features, tokens):
@@ -188,8 +187,14 @@ def inference(features, tokens):
cuda.memcpy_htod_async(d_inputs[1], segment_ids, stream)
cuda.memcpy_htod_async(d_inputs[2], input_mask, stream)
+ bindings = [0 for _ in range(binding_idx_offset)] + [int(d_inp) for d_inp in d_inputs] + [int(d_output)]
+
+ # allocate address for IO tensor
+ for i in range(engine.num_io_tensors):
+ context.set_tensor_address(engine.get_tensor_name(i), bindings[i + binding_idx_offset])
+
# Run inference
- context.execute_async_v2(bindings=[0 for i in range(binding_idx_offset)] + [int(d_inp) for d_inp in d_inputs] + [int(d_output)], stream_handle=stream.handle)
+ context.execute_async_v3(stream_handle=stream.handle)
# Synchronize the stream
stream.synchronize()
eval_time_elapsed += (time.time() - eval_start_time)
diff --git a/demo/BERT/inference_varseqlen.py b/demo/BERT/inference_varseqlen.py
index 9cd08519..7eb87012 100644
--- a/demo/BERT/inference_varseqlen.py
+++ b/demo/BERT/inference_varseqlen.py
@@ -130,15 +130,14 @@ def question_features(tokens, question):
# for each additional profile needed. Here, we only use batch size 1, thus we only need the first profile.
with open(args.engine, 'rb') as f, trt.Runtime(TRT_LOGGER) as runtime, \
runtime.deserialize_cuda_engine(f.read()) as engine, engine.create_execution_context() as context:
+ # Create a stream in which to copy inputs/outputs and run inference.
+ stream = cuda.Stream()
# select engine profile
- context.active_optimization_profile = 0
+ context.set_optimization_profile_async(0, stream.handle)
input_nbytes = max_seq_length * trt.int32.itemsize
- # Create a stream in which to copy inputs/outputs and run inference.
- stream = cuda.Stream()
-
# Allocate device memory for inputs.
d_inputs = [cuda.mem_alloc(input_nbytes) for binding in range(4)]
@@ -164,14 +163,10 @@ def inference(features, tokens):
segment_ids = feature.segment_ids[0:S]
cu_seq_lens = np.array([0, S], dtype=np.int32);
- if context.get_binding_shape(0)[0] != S:
- context.set_binding_shape(0, (S,))
- if context.get_binding_shape(1)[0] != S:
- context.set_binding_shape(1, (S,))
- if context.get_binding_shape(2)[0] != 2:
- context.set_binding_shape(2, (2,))
- if context.get_binding_shape(3)[0] != S:
- context.set_binding_shape(3, (S,))
+ input_dim0_shape = {"input_ids": S, "segment_ids": S, "cu_seqlens": 2, "max_seqlen": S}
+ for name, val in input_dim0_shape.items():
+ if context.get_tensor_shape(name)[0] != val:
+ context.set_input_shape(name, (val,))
h_input_ids = cuda.register_host_memory(np.ascontiguousarray(input_ids.ravel()))
h_segment_ids = cuda.register_host_memory(np.ascontiguousarray(segment_ids.ravel()))
@@ -182,8 +177,14 @@ def inference(features, tokens):
cuda.memcpy_htod_async(d_inputs[1], h_segment_ids, stream)
cuda.memcpy_htod_async(d_inputs[2], h_cu_seq_lens, stream)
+ # Setup tensor address
+ bindings = [int(d_inputs[i]) for i in range(4)] + [int(d_output)]
+
+ for i in range(engine.num_io_tensors):
+ context.set_tensor_address(engine.get_tensor_name(i), bindings[i])
+
# Run inference
- context.execute_async_v2(bindings=[int(d_inp) for d_inp in d_inputs] + [int(d_output)], stream_handle=stream.handle)
+ context.execute_async_v3(stream_handle=stream.handle)
# Synchronize the stream
stream.synchronize()
eval_time_elapsed += (time.time() - eval_start_time)
diff --git a/demo/BERT/notebooks/Q-and-A.ipynb b/demo/BERT/notebooks/Q-and-A.ipynb
index c262a9cb..9c82199a 100755
--- a/demo/BERT/notebooks/Q-and-A.ipynb
+++ b/demo/BERT/notebooks/Q-and-A.ipynb
@@ -20,7 +20,7 @@
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License.\n",
- "# =============================================================================="
+ "# ==============================================================================\n"
]
},
{
@@ -124,8 +124,14 @@
" cuda.memcpy_htod_async(d_inputs[1], segment_ids, stream)\n",
" cuda.memcpy_htod_async(d_inputs[2], input_mask, stream)\n",
"\n",
+ " # Setup tensor address\n",
+ " bindings = [int(d_inputs[i]) for i in range(3)] + [int(d_output)]\n",
+ "\n",
+ " for i in range(engine.num_io_tensors):\n",
+ " context.set_tensor_address(engine.get_tensor_name(i), bindings[i])\n",
+ "\n",
" # Run inference\n",
- " trt_context.execute_async_v2(bindings=[int(d_inp) for d_inp in d_inputs] + [int(d_output)], stream_handle=stream.handle)\n",
+ " trt_context.execute_async_v3(stream_handle=stream.handle)\n",
" # Synchronize the stream\n",
" stream.synchronize()\n",
" eval_time_elapsed += (time.time() - eval_start_time)\n",
@@ -172,16 +178,21 @@
" S = np.sum(feature.input_mask)\n",
" input_ids = feature.input_ids[0:S]\n",
" segment_ids = feature.segment_ids[0:S]\n",
- " cu_seq_lens = np.array([0, S], dtype=np.int32);\n",
- "\n",
- " if context.get_binding_shape(0)[0] != S:\n",
- " context.set_binding_shape(0, (S,))\n",
- " if context.get_binding_shape(1)[0] != S:\n",
- " context.set_binding_shape(1, (S,))\n",
- " if context.get_binding_shape(2)[0] != 2:\n",
- " context.set_binding_shape(2, (2,))\n",
- " if context.get_binding_shape(3)[0] != S:\n",
- " context.set_binding_shape(3, (S,))\n",
+ " cu_seq_lens = np.array([0, S], dtype=np.int32)\n",
+ "\n",
+ " first_tensor_name = engine.get_tensor_name(0)\n",
+ " second_tensor_name = engine.get_tensor_name(1)\n",
+ " third_tensor_name = engine.get_tensor_name(2)\n",
+ " forth_tensor_name = engine.get_tensor_name(3)\n",
+ "\n",
+ " if context.get_tensor_shape(first_tensor_name)[0] != S:\n",
+ " context.set_input_shape(first_tensor_name, (S,))\n",
+ " if context.get_tensor_shape(second_tensor_name)[0] != S:\n",
+ " context.set_input_shape(second_tensor_name, (S,))\n",
+ " if context.get_tensor_shape(third_tensor_name)[0] != 2:\n",
+ " context.set_input_shape(third_tensor_name, (2,))\n",
+ " if context.get_tensor_shape(forth_tensor_name)[0] != S:\n",
+ " context.set_input_shapee(forth_tensor_name, (S,))\n",
"\n",
" h_input_ids = cuda.register_host_memory(np.ascontiguousarray(input_ids.ravel()))\n",
" h_segment_ids = cuda.register_host_memory(np.ascontiguousarray(segment_ids.ravel()))\n",
@@ -192,8 +203,14 @@
" cuda.memcpy_htod_async(d_inputs[1], h_segment_ids, INT8_stream)\n",
" cuda.memcpy_htod_async(d_inputs[2], h_cu_seq_lens, INT8_stream)\n",
"\n",
+ " # Setup tensor address\n",
+ " bindings = [int(d_inputs[i]) for i in range(3)] + [int(d_output)]\n",
+ "\n",
+ " for i in range(engine.num_io_tensors):\n",
+ " context.set_tensor_address(engine.get_tensor_name(i), bindings[i])\n",
+ "\n",
" # Run inference\n",
- " trt_context.execute_async_v2(bindings=[int(d_inp) for d_inp in d_inputs] + [int(d_output)], stream_handle=INT8_stream.handle)\n",
+ " trt_context.execute_async_v3(stream_handle=INT8_stream.handle)\n",
" # Synchronize the stream\n",
" INT8_stream.synchronize()\n",
" eval_time_elapsed += (time.time() - eval_start_time)\n",
@@ -256,11 +273,12 @@
"# Specify input shapes. These must be within the min/max bounds of the active profile (0th profile in this case)\n",
"# Note that input shapes can be specified on a per-inference basis, but in this case, we only have a single shape.\n",
"for binding in range(3):\n",
- " context.set_binding_shape(binding, input_shape)\n",
+ " tensor_name = engine.get_tensor_name(binding)\n",
+ " context.set_input_shape(tensor_name, input_shape)\n",
"assert context.all_binding_shapes_specified\n",
"\n",
"# Allocate output buffer by querying the size from the context. This may be different for different input shapes.\n",
- "h_output = cuda.pagelocked_empty(tuple(context.get_binding_shape(3)), dtype=np.float32)\n",
+ "h_output = cuda.pagelocked_empty(tuple(context.get_tensor_shape(engine.get_tensor_name(3))), dtype=np.float32)\n",
"d_output = cuda.mem_alloc(h_output.nbytes)\n",
"\n",
"# Create a stream in which to copy inputs/outputs and run inference.\n",
@@ -275,7 +293,7 @@
"INT8_context = INT8_engine.create_execution_context()\n",
"\n",
"# select engine profile\n",
- "INT8_context.active_optimization_profile = 0\n",
+ "INT8_context.set_optimization_profile_async(0, stream.handle)\n",
"\n",
"input_nbytes = max_seq_length * trt.int32.itemsize\n",
"\n",
@@ -287,7 +305,7 @@
"INT8_d_output = cuda.mem_alloc(INT8_h_output.nbytes)\n",
"\n",
"# Create a stream in which to copy inputs/outputs and run inference.\n",
- "INT8_stream = cuda.Stream()"
+ "INT8_stream = cuda.Stream()\n"
]
},
{
@@ -412,7 +430,7 @@
" orientation='horizontal', \n",
" layout=widgets.Layout(width='100%', height='50px')\n",
")\n",
- "display(progress_bar)"
+ "display(progress_bar)\n"
]
},
{
diff --git a/demo/BERT/notebooks/benchmark.ipynb b/demo/BERT/notebooks/benchmark.ipynb
index 69666732..d09ec429 100755
--- a/demo/BERT/notebooks/benchmark.ipynb
+++ b/demo/BERT/notebooks/benchmark.ipynb
@@ -20,7 +20,7 @@
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License.\n",
- "# =============================================================================="
+ "# ==============================================================================\n"
]
},
{
@@ -143,32 +143,35 @@
" cuda.memcpy_htod(buffers[2].buf, test_cu_seq_lens.ravel())\n",
"\n",
" bench_times = {}\n",
+ " stream = cuda.Stream()\n",
"\n",
" for idx, batch_size in enumerate(sorted(args.batch_size)):\n",
- " num_binding_per_profile = engine.num_bindings // engine.num_optimization_profiles\n",
" for idx in range(engine.num_optimization_profiles):\n",
- " profile_shape = engine.get_profile_shape(profile_index = idx, binding = idx * num_binding_per_profile)\n",
+ " profile_shape = engine.get_profile_shape(profile_index = idx, binding = idx * engine.num_io_tensors)\n",
" if profile_shape[0][0] <= batch_size and profile_shape[2][0] >= batch_size:\n",
- " context.active_optimization_profile = idx\n",
- " binding_idx_offset = idx * num_binding_per_profile\n",
+ " context.set_optimization_profile_async(idx, stream.handle)\n",
+ " binding_idx_offset = idx * engine.num_io_tensors\n",
" break\n",
"\n",
" # Each profile has unique bindings\n",
" bindings = [0] * binding_idx_offset + [buf.binding() for buf in buffers]\n",
" input_shape = (batch_size, args.sequence_length)\n",
" for binding in range(3):\n",
- " context.set_binding_shape(binding_idx_offset + binding, input_shape)\n",
+ " tensor_name = engine.get_tensor_name(binding)\n",
+ " context.set_input_shape(tensor_name, input_shape)\n",
" assert context.all_binding_shapes_specified\n",
"\n",
+ " for i in range(engine.num_io_tensors):\n",
+ " context.set_tensor_address(engine.get_tensor_name(i), bindings[i + binding_idx_offset])\n",
+ "\n",
" # Inference\n",
" total_time = 0\n",
" start = cuda.Event()\n",
" end = cuda.Event()\n",
- " stream = cuda.Stream()\n",
"\n",
" # Warmup\n",
" for _ in range(args.warm_up_runs):\n",
- " context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)\n",
+ " context.execute_async_v3(stream_handle=stream.handle)\n",
" stream.synchronize()\n",
"\n",
" # Timing loop\n",
@@ -176,7 +179,7 @@
" progress_bar.value = 0\n",
" for _ in range(iteration_selector.value):\n",
" start.record(stream)\n",
- " context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)\n",
+ " context.execute_async_v3(stream_handle=stream.handle)\n",
" end.record(stream)\n",
" stream.synchronize()\n",
" times.append(end.time_since(start))\n",
@@ -227,26 +230,28 @@
" cuda.memcpy_htod(buffers[1].buf, test_segment_ids.ravel())\n",
" cuda.memcpy_htod(buffers[2].buf, test_input_mask.ravel())\n",
"\n",
- " num_binding_per_profile = engine.num_bindings // engine.num_optimization_profiles\n",
- "\n",
" bench_times = {}\n",
+ " stream = cuda.Stream()\n",
"\n",
" for idx, batch_size in enumerate(sorted(args.batch_size)):\n",
- " num_binding_per_profile = engine.num_bindings // engine.num_optimization_profiles\n",
" for idx in range(engine.num_optimization_profiles):\n",
- " profile_shape = engine.get_profile_shape(profile_index = idx, binding = idx * num_binding_per_profile)\n",
+ " profile_shape = engine.get_profile_shape(profile_index = idx, binding = idx * engine.num_io_tensors)\n",
" if profile_shape[0][0] <= batch_size and profile_shape[2][0] >= batch_size:\n",
- " context.active_optimization_profile = idx\n",
- " binding_idx_offset = idx * num_binding_per_profile\n",
+ " context.set_optimization_profile_async(idx, stream.handle)\n",
+ " binding_idx_offset = idx * engine.num_io_tensors\n",
" break\n",
"\n",
" # Each profile has unique bindings\n",
" bindings = [0] * binding_idx_offset + [buf.binding() for buf in buffers]\n",
" input_shape = (batch_size, args.sequence_length)\n",
" for binding in range(3):\n",
- " context.set_binding_shape(binding_idx_offset + binding, input_shape)\n",
+ " tensor_name = engine.get_tensor_name(binding)\n",
+ " context.set_input_shape(tensor_name, input_shape)\n",
" assert context.all_binding_shapes_specified\n",
"\n",
+ " for i in range(engine.num_io_tensors):\n",
+ " context.set_tensor_address(engine.get_tensor_name(i), bindings[i + binding_idx_offset])\n",
+ "\n",
" # Inference\n",
" total_time = 0\n",
" start = cuda.Event()\n",
@@ -255,7 +260,7 @@
"\n",
" # Warmup\n",
" for _ in range(args.warm_up_runs):\n",
- " context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)\n",
+ " context.execute_async_v3(stream_handle=stream.handle)\n",
" stream.synchronize()\n",
"\n",
" # Timing loop\n",
@@ -263,7 +268,7 @@
" progress_bar.value = 0\n",
" for _ in range(iteration_selector.value):\n",
" start.record(stream)\n",
- " context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)\n",
+ " context.execute_async_v3(stream_handle=stream.handle)\n",
" end.record(stream)\n",
" stream.synchronize()\n",
" times.append(end.time_since(start))\n",
diff --git a/demo/BERT/perf.py b/demo/BERT/perf.py
index 5943b41b..7b4e9da9 100644
--- a/demo/BERT/perf.py
+++ b/demo/BERT/perf.py
@@ -77,8 +77,6 @@ def main():
cuda.memcpy_htod(buffers[1].buf, test_segment_ids.ravel())
cuda.memcpy_htod(buffers[2].buf, test_input_mask.ravel())
- num_binding_per_profile = engine.num_bindings // engine.num_optimization_profiles
-
bench_times = {}
stream = cuda.Stream()
@@ -86,7 +84,7 @@ def main():
# Select engine profile
selected_profile = -1
for idx in range(engine.num_optimization_profiles):
- profile_shape = engine.get_profile_shape(profile_index = idx, binding = idx * num_binding_per_profile)
+ profile_shape = engine.get_tensor_profile_shape(name = "input_ids", profile_index = idx)
if profile_shape[0][0] <= batch_size and profile_shape[2][0] >= batch_size and profile_shape[0][1] <= args.sequence_length and profile_shape[2][1] >= args.sequence_length:
selected_profile = idx
break
@@ -95,18 +93,16 @@ def main():
context.set_optimization_profile_async(selected_profile, stream.handle)
# Each profile has unique bindings
- binding_idx_offset = selected_profile * num_binding_per_profile
+ binding_idx_offset = selected_profile * engine.num_io_tensors
bindings = [0] * binding_idx_offset + [buf.binding() for buf in buffers]
- shapes = {
- "input_ids": (batch_size, args.sequence_length),
- "segment_ids": (batch_size, args.sequence_length),
- "input_mask": (batch_size, args.sequence_length),
- }
+ input_shape = (batch_size, args.sequence_length)
+ for name in ["input_ids", "segment_ids", "input_mask"]:
+ context.set_input_shape(name, input_shape)
+ assert len(context.infer_shapes()) == 0
- for binding, shape in shapes.items():
- context.set_binding_shape(engine[binding] + binding_idx_offset, shape)
- assert context.all_binding_shapes_specified
+ for i in range(engine.num_io_tensors):
+ context.set_tensor_address(engine.get_tensor_name(i), bindings[i + binding_idx_offset])
# Inference
total_time = 0
@@ -115,7 +111,7 @@ def main():
# Warmup
for _ in range(args.warm_up_runs):
- context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
+ context.execute_async_v3(stream_handle=stream.handle)
stream.synchronize()
# Timing loop
@@ -124,7 +120,7 @@ def main():
start_time = time.time()
while actual_iterations < args.iterations or (time.time() - start_time) < args.duration:
start.record(stream)
- context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
+ context.execute_async_v3(stream_handle=stream.handle)
end.record(stream)
stream.synchronize()
times.append(end.time_since(start))
diff --git a/demo/BERT/perf_varseqlen.py b/demo/BERT/perf_varseqlen.py
index a1680797..853201a4 100644
--- a/demo/BERT/perf_varseqlen.py
+++ b/demo/BERT/perf_varseqlen.py
@@ -81,7 +81,8 @@ def main():
bench_times = {}
for idx, batch_size in enumerate(sorted(args.batch_size)):
- context.active_optimization_profile = 0
+ stream = cuda.Stream()
+ context.set_optimization_profile_async(0, stream.handle)
# Each profile has unique bindings
bindings = [buf.binding() for buf in buffers]
@@ -94,18 +95,20 @@ def main():
}
for binding, shape in shapes.items():
- context.set_binding_shape(engine[binding], shape)
- assert context.all_binding_shapes_specified
+ context.set_input_shape(binding, shape)
+ assert len(context.infer_shapes()) == 0
+
+ for i in range(engine.num_io_tensors):
+ context.set_tensor_address(engine.get_tensor_name(i), bindings[i])
# Inference
total_time = 0
start = cuda.Event()
end = cuda.Event()
- stream = cuda.Stream()
# Warmup
for _ in range(args.warm_up_runs):
- context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
+ context.execute_async_v3(stream_handle=stream.handle)
stream.synchronize()
# Timing loop
@@ -114,7 +117,7 @@ def main():
start_time = time.time()
while actual_iterations < args.iterations or (time.time() - start_time) < args.duration:
start.record(stream)
- context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
+ context.execute_async_v3(stream_handle=stream.handle)
end.record(stream)
stream.synchronize()
times.append(end.time_since(start))
diff --git a/demo/DeBERTa/deberta_tensorrt_inference.py b/demo/DeBERTa/deberta_tensorrt_inference.py
index 6a579a1c..378a5953 100644
--- a/demo/DeBERTa/deberta_tensorrt_inference.py
+++ b/demo/DeBERTa/deberta_tensorrt_inference.py
@@ -169,9 +169,10 @@ def allocate_buffers(self, engine):
bindings = []
stream = cuda.Stream()
- for binding in engine: # binding is the name of input/output
- size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
- dtype = trt.nptype(engine.get_binding_dtype(binding))
+ for i in range(engine.num_io_tensors):
+ tensor_name = engine.get_tensor_name(i)
+ size = trt.volume(engine.get_tensor_shape(tensor_name))
+ dtype = trt.nptype(engine.get_tensor_dtype(tensor_name))
# Allocate host and device buffers
host_mem = cuda.pagelocked_empty(size, dtype) # page-locked memory buffer (won't swapped to disk)
@@ -181,7 +182,7 @@ def allocate_buffers(self, engine):
bindings.append(int(device_mem))
# Append to the appropriate input/output list.
- if engine.binding_is_input(binding):
+ if engine.get_tensor_mode(tensor_name) == trt.TensorIOMode.INPUT:
inputs.append(self.HostDeviceMem(host_mem, device_mem))
else:
outputs.append(self.HostDeviceMem(host_mem, device_mem))
@@ -212,8 +213,8 @@ def __call__(self, model_inputs: list, timing=False):
batch_size = batch_size[0]
for i, model_input in enumerate(model_inputs):
- binding_name = self.engine[i] # i-th input/output name
- binding_dtype = trt.nptype(self.engine.get_binding_dtype(binding_name)) # trt can only tell to numpy dtype
+ binding_name = self.engine.get_tensor_name(i) # i-th input/output name
+ binding_dtype = trt.nptype(self.engine.get_tensor_dtype(binding_name)) # trt can only tell to numpy dtype
# input type cast
if NUMPY:
@@ -238,6 +239,9 @@ def __call__(self, model_inputs: list, timing=False):
# input, Host to Device
[cuda.memcpy_htod_async(inp.device, inp.host, self.stream) for inp in self.inputs]
+ for i in range(self.engine.num_io_tensors):
+ self.context.set_tensor_address(self.engine.get_tensor_name(i), self.bindings[i])
+
duration = 0
if timing:
start_time = time()
@@ -246,7 +250,7 @@ def __call__(self, model_inputs: list, timing=False):
duration = end_time - start_time
else:
# run inference
- self.context.execute_async_v2(bindings=self.bindings, stream_handle=self.stream.handle) # v2 no need for batch_size arg
+ self.context.execute_async_v3(stream_handle=self.stream.handle)
if timing:
[cuda.memcpy_dtoh(out.host, out.device) for out in self.outputs]
@@ -277,7 +281,10 @@ def build_engine():
print(f'Building {precision} engine of {MODEL_NAME} model on {gpu_name} GPU...')
## parse ONNX model
- network = TRT_BUILDER.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
+ network_creation_flag = 0
+ if "EXPLICIT_BATCH" in trt.NetworkDefinitionCreationFlag.__members__.keys():
+ network_creation_flag = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
+ network = TRT_BUILDER.create_network(network_creation_flag)
onnx_parser = trt.OnnxParser(network, TRT_LOGGER)
parse_success = onnx_parser.parse_from_file(ONNX_MODEL)
for idx in range(onnx_parser.num_errors):
@@ -296,11 +303,7 @@ def build_engine():
profile.set_shape("input_ids", (1,seq_len), (1,seq_len), (1,seq_len))
profile.set_shape("attention_mask", (1,seq_len), (1,seq_len), (1,seq_len))
config.add_optimization_profile(profile)
-
- if TRT_VERSION >= 84:
- config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 4096 * (1 << 20)) # 4096 MiB, syntax after TRT 8.4
- else:
- config.max_workspace_size = 4096 * (1 << 20) # syntax before TRT 8.4
+ config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 4096 * (1 << 20)) # 4096 MiB
# precision
if precision == 'fp32':
@@ -329,7 +332,7 @@ def test_engine():
## psuedo-random input test
batch_size = 1
- seq_len = model.engine.get_binding_shape(0)[1]
+ seq_len = model.engine.get_tensor_shape(model.engine.get_tensor_name(0))[1]
vocab = 128203
gpu = torch.device('cuda')
torch.manual_seed(0) # make sure in each test the seed are the same
@@ -362,7 +365,7 @@ def correctness_check_engines():
## psuedo-random input test
batch_size = 1
- seq_len = model1.engine.get_binding_shape(0)[1]
+ seq_len = model1.engine.get_tensor_shape(model1.engine.get_tensor_name(0))[1]
vocab = 128203
gpu = torch.device('cuda')
# torch.manual_seed(0) # make sure in each test the seed are the same
diff --git a/demo/Diffusion/README.md b/demo/Diffusion/README.md
index 4b9ca625..d550c83b 100644
--- a/demo/Diffusion/README.md
+++ b/demo/Diffusion/README.md
@@ -1,32 +1,34 @@
# Introduction
-This demo application ("demoDiffusion") showcases the acceleration of Stable Diffusion pipeline using TensorRT.
+This demo application ("demoDiffusion") showcases the acceleration of Stable Diffusion and ControlNet pipeline using TensorRT.
# Setup
### Clone the TensorRT OSS repository
```bash
-git clone git@github.com:NVIDIA/TensorRT.git -b release/8.6 --single-branch
+git clone git@github.com:NVIDIA/TensorRT.git -b release/10.0 --single-branch
cd TensorRT
```
-### Launch TensorRT NGC container
+### Launch NVIDIA pytorch container
Install nvidia-docker using [these intructions](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker).
```bash
-docker run --rm -it --gpus all -v $PWD:/workspace nvcr.io/nvidia/pytorch:23.02-py3 /bin/bash
+docker run --rm -it --gpus all -v $PWD:/workspace nvcr.io/nvidia/pytorch:24.01-py3 /bin/bash
```
### Install latest TensorRT release
```bash
python3 -m pip install --upgrade pip
-python3 -m pip install --upgrade tensorrt
+python3 -m pip install --pre --upgrade --extra-index-url https://pypi.nvidia.com tensorrt
```
-Minimum required version is TensorRT 8.6.0. Check your installed version using:
+> NOTE: TensorRT 10.x is only available as a pre-release
+
+Check your installed version using:
`python3 -c 'import tensorrt;print(tensorrt.__version__)'`
> NOTE: Alternatively, you can download and install TensorRT packages from [NVIDIA TensorRT Developer Zone](https://developer.nvidia.com/tensorrt).
@@ -38,21 +40,21 @@ export TRT_OSSPATH=/workspace
cd $TRT_OSSPATH/demo/Diffusion
pip3 install -r requirements.txt
-# Create output directories
-mkdir -p onnx engine output
```
> NOTE: demoDiffusion has been tested on systems with NVIDIA A100, RTX3090, and RTX4090 GPUs, and the following software configuration.
```
-diffusers 0.14.0
-onnx 1.13.1
-onnx-graphsurgeon 0.3.26
-onnxruntime 1.14.1
-polygraphy 0.47.1
-tensorrt 8.6.1.6
-tokenizers 0.13.2
-torch 1.13.0
-transformers 4.26.1
+diffusers 0.26.3
+onnx 1.15.0
+onnx-graphsurgeon 0.3.27
+onnxruntime 1.17.0
+polygraphy 0.49.7
+tensorrt 10.0.0.6
+tokenizers 0.13.3
+torch 2.1.0
+transformers 4.31.0
+controlnet-aux 0.0.6
+nvidia-ammo 0.7.0
```
> NOTE: optionally install HuggingFace [accelerate](https://pypi.org/project/accelerate/) package for faster and less memory-intense model loading.
@@ -66,43 +68,104 @@ transformers 4.26.1
python3 demo_txt2img.py --help
python3 demo_img2img.py --help
python3 demo_inpaint.py --help
+python3 demo_controlnet.py --help
+python3 demo_txt2img_xl.py --help
```
### HuggingFace user access token
-To download the model checkpoints for the Stable Diffusion pipeline, you will need a `read` access token. See [instructions](https://huggingface.co/docs/hub/security-tokens).
+To download model checkpoints for the Stable Diffusion pipelines, obtain a `read` access token to HuggingFace Hub. See [instructions](https://huggingface.co/docs/hub/security-tokens).
```bash
export HF_TOKEN=
```
-### Generate an image guided by a single text prompt
+### Generate an image guided by a text prompt
+
+```bash
+python3 demo_txt2img.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN
+```
+
+### Generate an image guided by an initial image and a text prompt
+
+```bash
+wget https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg -O sketch-mountains-input.jpg
+
+python3 demo_img2img.py "A fantasy landscape, trending on artstation" --hf-token=$HF_TOKEN --input-image=sketch-mountains-input.jpg
+```
+
+### Generate an inpainted image guided by an image, mask and a text prompt
```bash
-python3 demo_txt2img.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN -v
+wget https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png -O dog-on-bench.png
+wget https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png -O dog-mask.png
+
+python3 demo_inpaint.py "a mecha robot sitting on a bench" --hf-token=$HF_TOKEN --input-image=dog-on-bench.png --mask-image=dog-mask.png
```
-### Generate an image guided by an image and single text prompt
+> NOTE: inpainting is only supported in versions `1.5` and `2.0`.
+
+### Generate an image with ControlNet guided by image(s) and text prompt(s)
+
+```bash
+python3 demo_controlnet.py "Stormtrooper's lecture in beautiful lecture hall" --controlnet-type depth --hf-token=$HF_TOKEN --denoising-steps 20 --onnx-dir=onnx-cnet-depth --engine-dir=engine-cnet-depth
+```
+
+> NOTE: `--input-image` must be a pre-processed image corresponding to `--controlnet-type`. If unspecified, a sample image will be downloaded. Supported controlnet types include: `canny`, `depth`, `hed`, `mlsd`, `normal`, `openpose`, `scribble`, and `seg`.
+
+Examples:
+
+
+#### Combining multiple conditionings
+
+Multiple ControlNet types can also be specified to combine the conditionings. While specifying multiple conditionings, controlnet scales should also be provided. The scales signify the importance of each conditioning in relation with the other. For example, to condition using `openpose` and `canny` with scales of 1.0 and 0.8 respectively, the arguments provided would be `--controlnet-type openpose canny` and `--controlnet-scale 1.0 0.8`. Note that the number of controlnet scales provided should match the number of controlnet types.
+
+
+### Generate an image with Stable Diffusion XL guided by a single text prompt
+
+Run the below command to generate an image with Stable Diffusion XL
```bash
-python3 demo_img2img.py "photorealistic new zealand hills" --hf-token=$HF_TOKEN -v
+python3 demo_txt2img_xl.py "a photo of an astronaut riding a horse on mars" --hf-token=$HF_TOKEN --version=xl-1.0
```
-Use `--input-image=` to specify your image. Otherwise the example image will be downloaded from the Internet.
+The optional refiner model may be enabled by specifying `--enable-refiner` and separate directories for storing refiner onnx and engine files using `--onnx-refiner-dir` and `--engine-refiner-dir` respectively.
-### Generate an inpainted image guided by an image, mask and single text prompt
+```bash
+python3 demo_txt2img_xl.py "a photo of an astronaut riding a horse on mars" --hf-token=$HF_TOKEN --version=xl-1.0 --enable-refiner --onnx-refiner-dir=onnx-refiner --engine-refiner-dir=engine-refiner
+```
+
+### Generate an image guided by a text prompt, and using specified LoRA model weight updates
```bash
-# Create separate onnx/engine directories when switching versions
-mkdir -p onnx-1.5 engine-1.5
+python3 demo_txt2img_xl.py "Picture of a rustic Italian village with Olive trees and mountains" --version=xl-1.0 --lora-path "ostris/crayon_style_lora_sdxl" "ostris/watercolor_style_lora_sdxl" --lora-scale 0.3 0.7 --onnx-dir onnx-sdxl-lora --engine-dir engine-sdxl-lora --build-enable-refit
+```
+
+### Faster Text-to-image using SDXL & INT8 quantization using AMMO
-python3 demo_inpaint.py "a mecha robot sitting on a bench" --hf-token=$HF_TOKEN --version=1.5 --onnx-dir=onnx-1.5 --engine-dir=engine-1.5 -v
+```bash
+python3 demo_txt2img_xl.py "a photo of an astronaut riding a horse on mars" --version xl-1.0 --onnx-dir onnx-sdxl --engine-dir engine-sdxl --int8 --quantization-level 3
```
-Use `--input-image=` and `--mask-image=` to specify your inputs. They must have the same dimensions. Otherwise the example image and mask will be downloaded from the Internet.
+Note that the calibration process can be quite time-consuming, and will be repeated if `--quantization-level`, `--denoising-steps`, or `--onnx-dir` is changed.
-### Input arguments
-- One can set schdeuler using `--scheduler=EulerA`. Note that some schedulers are not available for some pipelines or version.
-- To accelerate engine building time one can use `--timing-cache=`. This cache file will be created if does not exist. Note, that it may influence the performance if the cache file created on the other hardware is used. It is suggested to use this flag only during development. To achieve the best perfromance during deployment, please, build engines without timing cache.
-- To switch between versions or pipelines one needs either to clear onnx and engine dirs, or to specify `--force-onnx-export --force-onnx-optimize --force-engine-build` or to create new dirs and to specify `--onnx-dir= --engine-dir=`.
+### Faster Text-to-Image using SDXL + LCM (Latent Consistency Model) LoRA weights
+[LCM-LoRA](https://arxiv.org/abs/2311.05556) produces good quality images in 4 to 8 denoising steps instead of 30+ needed base model. Note that we use LCM scheduler and disable classifier-free-guidance by setting `--guidance-scale` to 0.
+LoRA weights are fused into the ONNX and finalized TensorRT plan files in this example.
+```bash
+python3 demo_txt2img_xl.py "Einstein" --version xl-1.0 --lora-path "latent-consistency/lcm-lora-sdxl" --lora-scale 1.0 --onnx-dir onnx-sdxl-lcm-nocfg --engine-dir engine-sdxl-lcm-nocfg --denoising-steps 4 --scheduler LCM --guidance-scale 0.0
+```
+### Faster Text-to-Image using SDXL Turbo
+Even faster image generation than LCM, producing coherent images in just 1 step. Note: SDXL Turbo works best for 512x512 resolution, EulerA scheduler and classifier-free-guidance disabled.
+```bash
+python3 demo_txt2img_xl.py "Einstein" --version xl-turbo --onnx-dir onnx-sdxl-turbo --engine-dir engine-sdxl-turbo --denoising-steps 1 --scheduler EulerA --guidance-scale 0.0 --width 512 --height 512
+```
+
+## Configuration options
+- Noise scheduler can be set using `--scheduler `. Note: not all schedulers are available for every version.
+- To accelerate engine building time use `--timing-cache `. The cache file will be created if it does not already exist. Note that performance may degrade if cache files are used across multiple GPU targets. It is recommended to use timing caches only during development. To achieve the best perfromance in deployment, please build engines without timing cache.
+- Specify new directories for storing onnx and engine files when switching between versions, LoRAs, ControlNets, etc. This can be done using `--onnx-dir ` and `--engine-dir `.
- Inference performance can be improved by enabling [CUDA graphs](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs) using `--use-cuda-graph`. Enabling CUDA graphs requires fixed input shapes, so this flag must be combined with `--build-static-batch` and cannot be combined with `--build-dynamic-shape`.
+
+
+
diff --git a/demo/Diffusion/calibration-prompts.txt b/demo/Diffusion/calibration-prompts.txt
new file mode 100644
index 00000000..8b224e6d
--- /dev/null
+++ b/demo/Diffusion/calibration-prompts.txt
@@ -0,0 +1,1079 @@
+Portrait shot of a woman, yellow shirt, photograph
+Little girl holding a teddy bear, in the middle of nowhere, photograph
+Portrait of an arctic fox in the tundra, light teal and amber, minimalist, photograph
+Confused woman, sci - fi, future, blue glow color, orange, hologram, photograph
+Symmetrical, macro shot, crying womans face, half of face is organic flowing RGB low poly, depth of field
+Beautiful woman future funk psychedelic
+Mosaic of a colorful mushroom with intricate patterns, vibrant and detailed, sharp, mosaic background, vector art
+Illustration of a man in red hoodie, minimalist, graphic design poster art, dark cyan and sky - blue, honeycore
+a bottle of perfume on a clean backdrop, surrounded by fragrant white flowers, product photography, minimalistic, natural light
+a bedroom with large windows and modern furniture, gray and gold, luxurious, mid century modern style
+an aerial drone shot of the breathtaking landscape of the Bora Bora islands, with sparkling waters under the sun
+extreme closeup shot of an old man with a long gray hair and head covered in wrinkles; focused expression looking at camera
+Simple flat vector illustration of a woman sitting at the desk with her laptop with a puppy, isolated on white background
+Chibi pixel art, game asset for an rpg game on a white background featuring the armor of a dragon sorcerer wielding the power of fire surrounded by a matching item set
+a macro wildlife photo of a green frog in a rainforest pond, highly detailed, eye-level shot
+kid's coloring book, a happy young girl holding a flower, cartoon, thick lines, black and white, white background
+Golden-haired elementary school white boy hugging his black-hair Taiwanese buddy face-to-face on dusk street, unreal engine, greg rutkowski, loish, rhads, beeple, makoto shinkai and lois van baarle, ilya kuvshinov, rossdraws, tom bagshaw, alphonse mucha, global illumination, detailed and intricate environment
+Tan skin Anime boy wearing a large black sweater and cat ear beanie with brown hair and eyes, full body, baggy cargo pants, full body, reference
+Fawn French Bulldog with big eyes, short legs, and chunky, stocky body eating food
+A white goose holding a paint brush
+Black, African descent, looks Japanese, wears glasses, Naruto type art, bandage on his nose, male, Anime 2D art, lazy eyes, Japanese earring in one ear, no beard, smiles sinisterly
+Male cow fursona wearing a red beanie
+a beautiful hyper-realistic anime Lofi, painted by greg rutkowski makoto shinkai takashi takeuchi studio ghibli, akihiko yoshida, anime, clean soft lighting, finely detailed features, high-resolution, perfect art, stunning atmosphere, trending on pixiv fanbox
+a woman with a beautiful face is enjoying a summer festival wearing a kimono, long white hair, looks like an older sister with a small body, is holding a traditional Japanese umbrella with a faint smile, her head is facing backwards as if inviting her to play and she is running with her arms behind her, there is also a lock of patterned hair flower
+A stunning photograph of a serene mountain lake at sunrise, with crystal-clear reflections and soft pastel skies
+A high-resolution image of an ancient oak tree in a lush forest, sunlight filtering through the leaves
+An ultra-realistic photograph of the Milky Way galaxy seen from a remote desert, under clear skies
+A detailed image of a colorful street market in Marrakech at golden hour, with vibrant fabrics and bustling crowds
+A professional photograph of a majestic bald eagle in flight, with a crisp focus on its sharp eyes and detailed feathers
+A perfect image of a charming cobblestone street in Prague, with historical buildings and a peaceful early morning atmosphere
+A photo-realistic image of a modern city skyline at night, with shimmering lights and reflections on a river
+An authentic-looking photograph of the Northern Lights over a snowy Lapland landscape, with vivid colors and clear stars
+A high-quality image of a vintage 1950s diner, with classic cars parked outside and a sunset backdrop
+An elegant photograph of a grand ballroom from the Victorian era, with ornate decorations and a grand chandelier
+A striking photograph of a powerful thunderstorm over the ocean, with dramatic lightning strikes and rolling waves
+An image of a peaceful Zen garden with smooth stones, raked sand, and a calming waterfall
+A high-resolution photograph of a seasoned fisherman at dawn, casting a net into the sea, with the golden light reflecting off the water
+A professional close-up shot of a woman's face, half-illuminated by the sunset, showcasing a detailed texture of her skin and a contemplative expression
+An image capturing a street dancer in mid-air during a dynamic breakdance move, with urban graffiti in the background
+A vibrant photograph of a group of people dressed in traditional attire at a cultural festival, dancing in a blur of colors and fabrics
+A cinematic-style photograph of a lone astronaut in a spacesuit, standing on a rocky alien landscape with Earth visible in the sky above
+Capture the quiet intensity in the eyes of a chess grandmaster poised over the board in a high-stakes match
+Close-up: A young girl's freckled face, focused and thoughtful, as she reads a book under the shade of an old tree
+Underwater photography of a diver among swirling schools of fish, light filtering down from above
+Evening falls on a city street musician, his guitar casting long shadows as he strums for the passing crowd
+High above the city, a construction worker perches on a steel beam, with a backdrop of the skyline stretching into the distance
+Document the intense expression of a potter as they shape a clay vessel, hands and wheel both a blur of motion
+A street portrait captures the weathered face of a long-time vendor, his cart a staple in the neighborhood for generations
+During golden hour, a group of children race through a field, their silhouettes a dance of joy against the setting sun
+Zoomed-in shot capturing the intense focus of a violinist as the bow gracefully sweeps across the strings, emotions etched into their performance
+Evening light bathes a street artist in a halo as they spray paint a vibrant mural, the colors telling a story as much as the subject's concentrated gaze
+A mid-action image of a chef's hands chopping herbs, with fine details showing flying droplets of water from the fresh greens
+On a misty morning, capture the solitary figure of a jogger on a deserted trail, their breath and stride in sync
+High in the mountains, a hiker reaches the summit, standing triumphantly with a panoramic view stretching behind them
+Illuminated by the soft glow of a desk lamp, a writer pauses, pen in hand, surrounded by stacks of manuscripts, lost in thought
+eerie, corruption, beautiful, young woman, sad eyes, tears running down, crying, innocence, light, vaporwave aesthetic, synthwave, colorful, psychedelic, crown, long gown, flowers, bees, butterflies, ribbons, ornate, intricate, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+wolf merged with crow,! photorealistic,! concept art
+To every living being, and every living soul. Now cometh the age of the stars. A thousand year voyage under the wisdom of the Moon. Here begins the chill night that encompasses all, reaching the great beyond. Into fear, doubt, and loneliness... As the path stretches into darkness. Mysterious shadow, detailed, digital, trending on artstation, hyper realistic, dark colours, 4k, dark aesthetic, in the style of James C. Christensen
+A cowboy cat with big and cute eyes, fine-face, realistic shaded perfect face, fine details. realistic shaded lighting poster by Ilya Kuvshinov katsuhiro otomo ghost-in-the-shell, magali villeneuve, artgerm, Jeremy Lipkin and Michael Garmash, Rob Rey and Kentarõ Miura style, trending on art station
+a very beautiful anime cute girl, full body, long wavy blond hair, sky blue eyes, full round face, short smile, fancy top, miniskirt, front view, summer lake setting, cinematic lightning, medium shot, mid-shot, highly detailed, trending on Artstation, Unreal Engine 4k, cinematic wallpaper by Stanley Artgerm Lau, WLOP, Rossdraws, James Jean, Andrei Riabovitchev, Marc Simonetti
+close-up portrait of the perfect and symmetrical face of a beautiful Cotton Mill Girl, symmetrical, centered, dramatic angle, ornate, details, smooth, sharp focus, illustration, realistic, cinematic, artstation, award winning, rgb , unreal engine, octane render, cinematic light, macro, depth of field, blur, red light and clouds from the back, highly detailed epic cinematic concept art CG render made in Maya, Blender and Photoshop, octane render, excellent composition, dynamic dramatic cinematic lighting, aesthetic, very inspirational, arthouse by Henri Cartier Bresson
+highly detailed portrait of beautiful ethereal woman in ornate clothing, stephen bliss, unreal engine, fantasy art by greg rutkowski, loish, rhads, ferdinand knab, makoto shinkai and lois van baarle, ilya kuvshinov, rossdraws, tom bagshaw, global illumination, radiant light, detailed and intricate environment
+Close-up portrait of young asian girl, long blonde hair, dark fantasy, portrait, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+necromancer glowing with purple magic, red hair, female, glacier landscape, D&D, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, art by Artgerm and Greg Rutkowski and Alphonse Mucha
+a portrait of riddler, fantasy, sharp focus, intricate, elegant, digital painting, artstation, matte, highly detailed, concept art, illustration, ambient lighting, art by ilya kuvshinov, artgerm, alphonse mucha, and greg rutkowski
+duotone dark scifi illustration 3 / 4 portrait of dream as if you live forever live as if you die tomorrow. cinematic lighting mad scientist style. golden ratio accidental renaissance. in the style of jean michel basquiat, beksisnski, and pablo picasso. graffiti art, scifi, fantasy, hyper detailed. octane render. concept art. trending on artstation
+elon musk as neo from the matrix, realistic portrait, symmetrical, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha
+patrick star with a sad!!! expression slouching on a bench in the bikini bottom, global illumination!!! dim lighting, midnight, cinematic, extremely detailed, beautiful, stunning composition, beautiful light rays, trending on artstation
+a girl in a hat with a bouquet of peonies looks out the window at a blooming garden, vivid color, highly detailed, cyberpunk, digital painting, artstation, concept art, matte, sharp focus, art by vrubel
+a detailed concept art of a fantasy jingle bell infused with magic, trending on artstation, digital art, 4 k, intricate, octane render, sharp focus
+“ dungeons and dragons tabaxi rogue, anthromorphic cat person with a repeating crossbow in a medieval city, small and big, illustration, fantasy, trending on artstation ”
+fantasy, book cover, concept art, by greg rutkowski and craig mullins, cozy atmospheric
+Amelie Poulain painted by Raphael volumetric lighting, back lighting, rimlight, dramatic lighting, digital painting, highly detailed, artstation, sharp focus, illustration, Artgerm, Jean-L�on G�r�me , ruan jia
+soft bokeh front shot photo of a mclaren steampunk concept car, cinematic, fine details, symmetrical, 4 k, digital art, wallpaper
+dior runway show, light, shadows, reflections, golden, gold, epic composition, intricate, elegant, volumetric lighting, digital painting, highly detailed, artstation, sharp focus, illustration, concept art, ruan jia, steve mccurry
+elven princess assassin, beautiful shadowing, 3 d shadowing, reflective surfaces, illustrated completely, 8 k beautifully detailed pencil illustration, extremely hyper - detailed pencil illustration, intricate, epic composition, very very kawaii, masterpiece, bold complimentary colors. stunning masterfully illustrated by artgerm and range murata.
+gorgeous red fox in a suit drinking champagne, digital art, landscape, fantasy art, octane render, ureal engine, high detail, very realistic, by greg rutkowski. by james gurney
+an extremely psychedelic portrait of medusa as willy wonka, surreal, lsd, face, detailed, intricate, elegant, lithe, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration
+portrait painting of a muscular bloodied mixed girl, ultra realistic, cyberpunk hacknaut, concept art, intricate details, eerie, highly detailed, photorealistic, octane render, 8 k, unreal engine. art by artgerm and greg rutkowski and alphonse mucha
+concept art for the main character in the award winning film named life is better in pink. the character is a unnaturally beautiful teenage girl with deep dark blue eyes and long curled pink hair, wearing light pink clothes. realistic cg render, anatomically correct, high key lighting, trending on art station, vibrant colors. cute and highly detailed eyes.
+beautiful woman, illustration, painting oil on canvas, intricate portrait, detailed, illustration, hd, digital art, overdetailed, art, concept, art
+detailed full body concept art illustration oil painting of an anthropomorphic capybara cook in full intricate clothing, biomutant, ultra detailed, digital art, octane render
+of a calm ocean with large strange cute happy flying creatures with huge eyes, mouth, long tongue and round teeth appearing from the sky, in the style of gehry and gaudi, macro lens, highly detailed, shallow depth of fielf, digital painting, trending artstation, concept art, illustration, cinematic lighting, vibrant colors, photorealism, epic, octane render
+symmetry, samurai, lines, brown skin, machine face, intricate, elegant, highly detailed, digital painting, artstation, cgsociety, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha, 8 k
+cyberpunk Normani as aeon flux profile picture by Greg Rutkowski, dynamic pose, intricate, futuristic, fantasy, elegant, by Stanley Artgerm Lau, greg rutkowski, thomas kindkade, alphonse mucha, loish, norman Rockwell, metal chrome, shiny, rainy background, asymmetric, afro hair,
+chris tucker as dhalsim street fighter, jump kick, 4 k, ultra realistic, detailed focused art by artgerm and greg rutkowski and alphonse mucha
+epic scene where mystical dead monk sitting in front of an epic portal, epic angle and pose, symmetrical artwork, 3d with depth of field, blurred background, cybernetic orchid flower butterfly jellyfish crystal dragon, female face skull phoenix bird, translucent, nautilus, energy flow. a highly detailed epic cinematic concept art CG render. made in Maya, Blender and Photoshop, octane render, excellent composition, cinematic dystopian brutalist atmosphere, dynamic dramatic cinematic lighting, aesthetic, very inspirational, arthouse. y Greg Rutkowski, Ilya Kuvshinov, WLOP, Stanley Artgerm Lau, Ruan Jia and Fenghua Zhong
+concept art of futuristic modular military base, top angle, oil painting by jama jurabaev, extremely detailed, brush hard, artstation, for aaa game, high quality, brush stroke
+portrait of natalie wood eating hamburgers, extra onions and ketchup, luscious patty with sesame seeds, feminine ethereal, handsome, d & d, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+Trending on Artstation, Dark and rainy mega city with towering walls built to block the migrants of the coming climate change migrant crisis showing piles of hundred bodies outside to maintain a quality of life for those who can survive the severe and deadly weather patterns observing small children targeted by advanced military style drones, dystopian, concept art illustration, tilt shift background, wide depth of field, 8k, 35mm film grain
+hard surface form fused with organic form fashion outfit design, rainbow iridescent accents, full body frontal view, Peter mohrbacher, zaha hadid, tsutomu nihei, emil melmoth, zdzislaw belsinki, Craig Mullins, yoji shinkawa, trending on artstation, beautifully lit, hyper detailed, insane details, intricate, elite, ornate, elegant, luxury, dramatic lighting, CGsociety, hypermaximalist, golden ratio, octane render, weta digital, micro details, ray trace, 8k,
+Gary Busey portrait by Stanley Artgerm Lau, greg rutkowski, thomas kindkade, alphonse mucha, loish, norman Rockwell
+( cyberpunk 2 0 7 7, bladerunner 2 0 4 9 ), a complex thick bifurcated robotic cnc surgical arm cybernetic symbiosis hybrid mri 3 d printer machine making a bio chemical lab, art by artgerm and greg rutkowski and alphonse mucha, biomechanical, lens orbs, global illumination, lounge, architectural, f 3 2,
+a vampire, male, mid - 3 0 s aged, long black hair, clean shaven, in red and black, high fantasy, realistic, highly detailed, concept art, 8 k.
+a elderly wizard casting a black fireball | | pencil sketch, realistic shaded, fine details, realistic shaded lighting poster by greg rutkowski, magali villeneuve, artgerm, jeremy lipkin and michael garmash and rob rey
+An elegant green, blue dragon, sitting on a clearing in a flowery jungle, detailed, mtg, digital illustration, trending on artstation
+a landscape made of whimsical energy and fibrous magic, artstation landscape, artstation digital, illustrated by eddie mendoza and greg rutkowski, trending on artstation, cgsociety contest winner, cgsociety hd, cgsociety 4 k uhd, 4 k, 8 k
+a cosmic painting of prince in space. mindblowing colours, trending on artstation. highly detailed face.
+martian chronicles, by jean delville and sophie anderson and mandy jurgens, retrofuturism, moody atmosphere, cinematic atmospheric, cinematic lighting, golden ratio, perfect composition, elegant, no crop, extremely detailed, 4 k, hd, sharp focus, masterpiece, trending on artstation
+a highly detailed metahuman 4 k close up render of a seraphim bella hadid monument renaissance in iris van herpen dress schiaparelli in diamonds crystals swarovski and jewelry iridescent in style of alphonse mucha gustav klimt trending on artstation made in unreal engine 4
+fever of the night, a grime tale of the night fever, disco club of the occult, digital painting, artstation, ristan eaton, victo ngai, artgerm, rhads, ross draws, anime styled
+symmetrical, full body portrait of a woman with short wavy hair, round face, cottagecore!!, lake, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+fantasy man sitting in library, gold brocaded dark blue clothes, short black hair, books, reddish brown engraved shelves, sharp focus, intricate, extremely detailed, cinematic lighting, smooth, ultra realistic illustration, high fantasy, elegant, artgerm, greg rutkowski, alphonse mucha magali villeneuve
+an anthropomorphic deer, fursona!!! by don bluth, by kawacy, trending on artstation, full body
+a cartoon squirrel drawn in concept art style
+russian poet alexander pushkin and shrek having breakfast together, portrait, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha
+beautiful woman on a turquise vespa moped, in the style of artgerm, gerald brom, atey ghailan and mike mignola, vibrant colors and hard shadows and strong rim light, plain background, comic cover art, trending on artstation, masterpiece
+a martian landscape, by ralph mac quarrie and francois schuiten and albert bierstadt and ernst haeckel and james jean and john singer sargent, cinematic lighting, moody atmosphere, golden ratio, perfect composition, elegant and stylish look, artstation, concept art, high quality
+� anime, full body, a pretty girl taking the college entrance exam, highly intricate detailed, light and shadow effects, intricate, highly detailed, digital painting, art station, concept art, smooth, sharp focus, illustration, advanced digital anime art, art by artgerm and greg rutkowski and alphonse mucha and william - adolphe bouguereau, craig mullins, j. c. leyendecker, atmospheric lighting, detailed face, by makoto shinkai, stanley artgerm lau, wlop, rossdraws �
+the second coming of the buddah, by dan mumford and ross tran, cosmic, heavenly, god rays, intricate detail, cinematic, 8 k, cel shaded, unreal engine, featured on artstation, pixiv
+phil noto, peter mohrbacher, thomas kinkade, artgerm, 1 9 5 0 s rockabilly anya taylor - joy catwoman dc comics, symmetrical eyes, city rooftop
+dnd character concept portrait, angry male elf druid in forest, detailed, high quality, dynamic lighting, fantasy, artwork by artgerm, wlop, alex ross, greg rutknowski, alphonse mucha
+a king with a skull head, in the style of artgerm, charlie bowater, atey ghailan and mike mignola, vibrant colors and hard shadows and strong rim light, plain background, comic cover art, trending on artstation
+With the spikes in her hair
+venus, the empress, wearing a magnificent dress, sitting on a divan in the middle of a beautiful green plains full of little flowers. intricate, elegant, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, by justin gerard and artgerm, 8 k
+beautiful apocalyptic woman with pink Mohawk, standing on mad max panzer tank, 4k ultra hd, fantasy dark art, tank girl, artgerm, concept art, artstation, octane render, elegant, detailed digital painting
+i crave only the cold clean certainty of steel and silicon, trending on artstation
+nikola tesla, lightning, portrait, sharp focus, digital art, concept art, dynamic lighting, epic composition, colorful, trending on artstation, by emylie boivin 2. 0, rossdraws 2. 0
+professional concept art of a symmetrical ominous floating terrifying thing in a dark room by artgerm and greg rutkowski ( thin white border ). an intricate, elegant, highly detailed digital painting, concept art, smooth, sharp focus, illustration, in the style of cam sykes, wayne barlowe, igor kieryluk.
+beautiful lifelike award winning marble statue bust of tsunku trending on art station artgerm greg rutkowski alphonse mucha museum quality cinematic atmospheric
+steampunk robot ant, unreal engine realistic render, 8 k, micro detail, intricate, elegant, highly detailed, centered, digital painting, artstation, smooth, sharp focus, illustration, artgerm, tomasz alen kopera, peter mohrbacher, donato giancola, joseph christian leyendecker, wlop, boris vallejo
+mtg character portrait of a brawny male leonin warrior african lion angel of justice, with fiery golden wings of flame, wearing shining armor, wielding flaming sword and holding large fiery shield, by peter mohrbacher, wadim kashin, greg rutkowski, larry elmore, george pemba, ernie barnes, raymond swanland, magali villeneuve, trending on artstation
+dynamic portrait painting of Michael Myers sitting in the waiting room of an optometrist amongst other normal patients, sharp focus, face focused, trending on ArtStation, masterpiece, by Greg Rutkowski, by Ross Tran, by Fenghua Zhong, octane, soft render, oil on canvas, moody lighting, high contrast, cinematic, professional environmental concept art
+Concept art of male high elf with light blue hair, black leather armor, golden eagle skull on chest, by Naranbaatar Ganbold, trending on artstation
+a closeup portrait of a mia khalifa, dramatic light, lake background, sunset, dark, painted by stanley lau, painted by greg rutkowski, painted by stanley artgerm, digital art, trending on artstation
+portrait of salman rushdie, deep focus, d & d, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+anime key visual of beautiful elizabeth olsen police officer, cyberpunk, futuristic, stunning features, perfect face, high details, digital painting, artstation, smooth, soft focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+Beautiful portrait of an attractive Persian Princess who is an architect, beautiful princess, face painting, dramatic lighting, intricate, wild, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha, footage from space camera
+full body portrait character concept art, anime key visual of a little witch with her capybara mascot, trending on pixiv fanbox, painted by makoto shinkai takashi takeuchi studio ghibli
+perfectly-centered-Portrait of the most beautiful people on the planet, river, washing clothes, intricate, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, Unreal Engine 5, 8K, art by artgerm and greg rutkowski and alphonse mucha
+wide shot of vietnamese solider girl, green uniform, burning city in the background, epic, elder scrolls art, fantasy, skyrim, hd shot, digital portrait, beautiful, artstation, by artgerm, guy denning, jakub rozalski, magali villeneuve and charlie bowater
+apocalyptic city, digital painting, artstation, concept art, donato giancola, Joseph Christian Leyendecker, WLOP, Boris Vallejo, Breathtaking, 8k resolution, extremely detailed, beautiful, establishing shot, artistic, hyperrealistic, octane render, cinematic lighting, dramatic lighting, masterpiece, light brazen
+male dracula rollerskating with rollerskates in a roller rink by charlie bowater and titian and artgerm, full body portrait, intricate, face, elegant, beautiful, highly detailed, dramatic lighting, sharp focus, trending on artstation, artstationhd, artstationhq, unreal engine, 4 k, 8 k
+a dark forest where gears and electronic parts grow on the trees tops, cyberpunk landscape wallpaper, d&d art, fantasy, painted, 4k, high detail, sharp focus
+Photorealistic elvish goddess in a magical bioluminescent forest Hyperdetailed photorealism, 108 megapixels, amazing depth, glowing rich colors, powerful imagery, psychedelic Overtones, 3D finalrender, 3d shading, cinematic lighting, artstation concept art
+portrait of a demon, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha and william - adolphe bouguereau
+the man stuck in the wall, creepy explorer sketch, godlike design, concept art, beyond the void, grand scale, intricate detailed
+Very very very very highly detailed epic central composition studio photography of face with venetian mask, intricate, dystopian, sci-fi, extremely detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, intimidating lighting, incredible art by Anna Dittmann and Jesper Ejsing and Anton Pieck
+water, glowing lights!! intricate elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by greg rutkowski
+highly detailed portrait of Eminem wearing a beret and gold chains and brandishing a pistol, big eyes, realistic portrait, symmetrical, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha
+robocop torso, symmetry, faded colors, exotic alien features, cypherpunk background, tim hildebrandt, wayne barlowe, bruce pennington, donato giancola, larry elmore, masterpiece, trending on artstation, featured on pixiv, cinematic composition, beautiful lighting, sharp, details, hyper detailed, 8 k, unreal engine 5
+Boris Johnson as Neo from Matrix, black sunglasses, realistic portrait, symmetrical, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha
+a copic maker sketch of a stewardess girl wearing kikyo's clothing designed by balenciaga by john berkey by stanley artgerm lau, greg rutkowski, thomas kinkade, alphonse mucha, loish, norman rockwell
+a matte painting of a man sitting down and having a cup of tea in his house by the beach, in the style of artgerm, charlie bowater, atey ghailan and mike mignola, vibrant colors and hard shadows and strong rim light, plain background, comic cover art, trending on artstation
+a smug exclusivists female, black ink line art and watercolor, intricate, digital painting, concept art, smooth, focus, rim light style tim burton
+3 5 mm portrait of samurai in training dojo, in the style of david cronenberg, scary, weird, high fashion, id magazine, vogue magazine, surprising, freak show, realistic, sharp focus, 8 k high definition, film photography, photo realistic, insanely detailed, intricate, by david kostic and stanley lau and artgerm
+rats fixing cars in the garage, key visual, a fantasy digital painting by makoto shinkai and james gurney, trending on artstation, highly detailed
+photo of a gorgeous sultry young woman in the style of David la chapelle , realistic, sharp focus, 8k high definition, 35mm film photography, photo realistic, insanely detailed, intricate, elegant, art by David kostic and stanley lau and artgerm
+sliced coconut, electronics, ai, cartoonish cute, pine trees, dramatic atmosphere, trending on artstation, 3 0 mm, by noah bradley trending on artstation, deviantart, high detail, stylized portrait
+360 degree equirectangular, anthropomorphic family of mushrooms, family portrait, Art Deco nature, mystical fantasy, Pixar cute character design, intricate art deco mushroom patterns, elegant, sharp focus, 360 degree equirectangular panorama, art by Artgerm and beeple and Greg Rutkowski and WLOP, 360 monoscopic equirectangular
+portrait of othinus from toaru, anime fantasy illustration by tomoyuki yamasaki, kyoto studio, madhouse, ufotable, trending on artstation
+a portrait of a evil cybernetic magician in glass armor releasing spell, full height, moving forward, cyberpunk concept art, trending on artstation, highly detailed, intricate, sharp focus, digital art, 8 k
+Portrait of the black dragon Alduin breathing a rainbow-colored fire. 4k. Concept art. High detail. Unreal engine.
+Greg Manchess portrait painting of Ganon from Legend of Zelda as Overwatch character, medium shot, asymmetrical, profile picture, Organic Painting, sunny day, Matte Painting, bold shapes, hard edges, street art, trending on artstation, by Huang Guangjian and Gil Elvgren and Sachin Teng
+a hyper - realistic character concept art portrait of emilia clarke, depth of field background, artstation, award - winning realistic sci - fi concept art by jim burns and greg rutkowski, beksinski, a realism masterpiece, james gilleard, bruegel, alphonse mucha, and yoshitaka amano.
+Wide shot of a chrome spaceship in battle, explosions and purple lasers. Asteroid belt. Scenic view, in the void of space, underexposed, matte painting by Craig mullins and Emmanuel_Shiu and john berkey, cinematic, dark sci-fi, concept art trending on artstation, 4k, insane details, ultra realistic
+ebony beauty portrait, black red smoke, ink, stylized tattoos, draconic priestess, portrait by Artgerm, peter mohrbacher
+leonine devil in flowing robes, ethereal, backlit, high fantasy, highly detailed, puzzled expression, realistic lighting, sharp focus, intricate, by artgerm, wlop, crossdress, frank frazetta, trending on artstation
+giant magical floating golden sun, bright godrays, vibrant colors, by sylvain sarrailh, rossdraws, ambient light, ultra detailed, fantasy artwork, 8 k, volumetric lighting, trending on artstation, award winning, beautiful scenery, very beautiful.
+a 3 d render of a stack of green cubes on the left and an orange ball on the right in a red room, blender, ue 5, octane render, trending on artstation
+ori and the olw, close up bokeh hiperrealistic, high detailled, darkness dramatic, sharp focus, octane render, imax
+richly detailed color illustration of a fiending-addict-seeking-at-the-doctors-office illustrated by Artgerm and Mina Petrovic and Timothy Kong and Marina Federovna. 3D shadowing
+a study of cell shaded portrait of Dora the Explorer as a Borderlands 3 character, llustration, post grunge, concept art by josan gonzales and wlop, by james jean, Victo ngai, David Rubín, Mike Mignola, Laurie Greasley, highly detailed, sharp focus, alien, Trending on Artstation, HQ, deviantart, art by artgem
+A beautiful female warrior holding a bow an arrow wearing a magical bikini posing on a rock in a magical forest, super detailed and realistic face, fantasy art, in the style of Artgerm, illustration, epic, fantasy, intricate, hyper detailed, artstation, concept art, smooth, sharp focus, ray tracing, vibrant
+Lofi Steampunk Bioshock portrait, Pixar style, by Tristan Eaton Stanley Artgerm and Tom Bagshaw
+a beautiful hyperdetailed highly detailed urbex industrial architecture tower nature building unfinished building by zaha hadid, retro sunset retrowave darkacademia at fall hyperrealism cgsociety tokyo at night thermal vision, archdaily, wallpaper, highly detailed, trending on artstation.
+Cyborg woman sitting on a chair in a futuristic room smoking a cigar, intricate, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, Unreal Engine 5, 8K, art by artgerm and greg rutkowski and alphonse mucha
+young angry woman, beautiful girl, full body, explosive hair, cowboy hat, realistic, serov, surikov, vasnetsov, repin, kramskoi, insanely detailed, charlie bowater, tom bagshaw, high resolution, octane rendered, unreal engine, illustration, trending on artstation, masterpiece, 8 k
+an anime landscape of a girl wearing a kimono, near the river in a japanese summer festival from skyrim, by stanley artgerm lau, wlop, rossdraws, james jean, andrei riabovitchev, marc simonetti, and sakimichan, trending on artstation
+highly detailed portrait of a man with a handsaw head by greg rutkowski and fujimoto tatsuki, dramatic lighting, dynamic pose, dynamic perspective
+film noir woman, character sheet, concept design, contrast, hot toys, kim jung gi, greg rutkowski, zabrocki, karlkka, jayison devadas, trending on artstation, 8 k, ultra wide angle, pincushion lens effect
+portrait of a diabolical marble stone cyborg, wearing torn white cape, dynamic pose, glowing eyes, post apocalyptic ancient ruins, glowing veins subsurface scattering, in clouds, sunset, portrait, by gerald brom, by mikhail vrubel, by peter elson, muted colors, extreme detail, trending on artstation, 8 k
+portrait of donald trump, soft hair, muscular, half body, leather, hairy, d & d, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+black super hero girl | very very anime!!!, fine - face, beyonce, realistic shaded perfect face, fine details. anime. realistic shaded lighting poster by ilya kuvshinov katsuhiro otomo ghost - in - the - shell, magali villeneuve, artgerm, jeremy lipkin and michael garmash and rob rey
+richly detailed color illustration of a nerd-core-instructional-video illustrated by Artgerm and Mina Petrovic and Timothy Kong and Marina Federovna. 3D shadowing
+a group of spanish trap singers drinking red wine, oil painting by alex katz, trending on artstation
+photorealistic beautiful ethereal natalie portman in the style of michael whelan and greg rutkowski. hyperdetailed photorealism, 1 0 8 megapixels, amazing depth, glowing rich colors, powerful imagery, psychedelic overtones, 3 d finalrender, 3 d shading, cinematic lighting, artstation concept art
+a cute pet by neville page, ken barthelmey, carlos huante and doug chiang, sharp focus, trending on artstation, hyper realism, octane render, 8 k, hyper detailed, ultra detailed, highly detailed, zbrush, concept art, creature design
+very cute illustration for a children's book, digital art, detailed, rim light, exquisite lighting, clear focus, very coherent, details visible, soft lighting, character design, concept, atmospheric, dystopian, trending on artstation, fog, sun flare
+Still of a humanoid robot painting on a canvas, high detail, cinematic, , science fiction concept art by Greg Rutkowski and Moebius and Le Corbusier
+asymmetrical!! long shot of a snufkin smoking a pipe, nebula, intricate, elegant, highly detailed, digital painting, artstation, biolusence, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha, horizon zero dawn 8 k
+portrait of red - tinged, red leds, futuristic cybernetic warrior alien in profile, highly intricate, detailed humanoid, trending on artstation
+of a beautiful scary Hyperrealistic stone castle on top of a hill in the middle of a dark and creepy forest, macro lens, highly detailed, digital painting, trending artstation, concept art, illustration, cinematic lighting, vibrant colors, photorealism, epic, octane render
+piles of modular synth cables mixed with mangrove roots mixed with old video game consoles, puerto rican grafitti goddess chilling out wearing a headpiece made of circuit boards, by cameron gray, wlop, stanley kubrick, masamune, unique perspective, epic, trending on artstation, photorealistic, 3 d render, vivid
+oil painting portrait of a young woman with long flowing hair in a white dress, dancing through a field of flowers at sunset with mountains in the background, hazy, digital art, chiaroscuro, artstation, cinematic, golden hour, digital art painting by greg rutkowski, william - adolphe bouguereau, hazy atmosphere, flowers, cinematic lighting
+dark wizard of forest, elegant, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+photorealistic dog piloting a biplane. hyperdetailed photorealism, 1 0 8 megapixels, amazing depth, glowing rich colors, powerful imagery, psychedelic overtones, 3 d finalrender, 3 d shading, cinematic lighting, artstation concept art
+clear portrait of tony soprano, cottagecore!!, mafia background hyper detailed, character concept, full body, dynamic pose, intricate, criminal appearance, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+fantasy art of glowing goldfish swimming in the air, in the streets of a japanese town at night, with people watching in wonder, by fenghua zhong, highly detailed digital art, trending on artstation
+fantasy steps with pillars on both sides by greg rutkowski
+award winning digital portrait of a feminine attractive male jester at a magnificent circus, beautiful circus themed background with soft colors and lights, trending artstation, digital art, aesthetic, bloom, intricate, elegant, sharp focus, digital illustration, highly detailed, octane render, digital painting, concept art, fantasy, masterpiece, by lisa buijteweg and sakimichan
+a ultradetailed beautiful concept art of an old mind key, with intricate detail, oil panting, high resolution concept art, 4 k, by artgerm
+Ogun with large iron spears, he has tribal face markings and war paint, bronze-brown skin with african features and strong jaw line prominent brow and menacing look, wearing tribal armor, medium shot digital illustration trending on artstation by artgerm, face by wlop
+full face shot of rimuru tempest, sky blue straight hair, long bangs, with amber eyes, gold eyes, wearing a black jacket, high collar, ultra detailed, concept art, award winning photography, digital painting, cinematic, wlop artstation, closeup, pixiv, evil, yoshitaka amano, andy warhol, ilya kuvshinov,
+Moon Knight mixed with Goku, RPG Reference, art by ilya kuvshinov, artgerm, Alphonse mucha, and Greg Rutkowski, Trending on Artstation, octane render, Insanely Detailed, 8k, HD
+portrait of the cutest red fox ever, fluffy, cinematic view, epic sky, detailed, concept art, low angle, high detail, warm lighting, volumetric, godrays, vivid, beautiful, trending on artstation, by jordan grimmer, huge scene, grass, art greg rutkowski
+bandit, ultra detailed fantasy, elden ring, realistic, dnd, rpg, lotr game design fanart by concept art, behance hd, artstation, deviantart, global illumination radiating a glowing aura global illumination ray tracing hdr render in unreal engine 5
+ilya kuvshinov with blue hair, yellow irises, professional digital painting, concept art, unreal engine 5, 8 k, cinematic, wlop, tendrils in the background, art by greg rutkowski, pixiv art, junji ito, yoshitaka amano
+high resolution concept art of naruto and yoda kissing in paris
+character concept portrait of a stoic and proud woman in an elegant gown, pale face, intricate, elegant, digital painting, concept art, smooth, sharp focus, illustration, from Metal Gear, by Ruan Jia and Mandy Jurgens and William-Adolphe Bouguereau, Artgerm
+symmetry!! portrait of skull, sci - fi, glowing lights!! intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha, 8 k
+realistic portrait of beautifully crystalized and detailed portrait of a biomech zombie woman wearing a gasmask, matte painting of cinematic movie scene red dragon, horror, created by gustave dore and greg rutkowski, high detailed, smooth draw, synthwave neon retro, intricate, realistic proportions, dramatic lighting, trending on artstation.
+a portrait of sexy lady casting ice - ball and shoot it, cyberpunk concept art, trending on artstation, highly detailed, intricate, sharp focus, digital art, 8 k
+close up shot of a full body floating astronaut portrait smoke elemental fading into white smoke, high contrast, james gurney, peter mohrbacher, mike mignola, black paper, mandelbulb fractal, trending on artstation, exquisite detail perfect, large brush strokes, bold pinks and blues tones, intricate ink illustration, black background
+beautiful blonde teenage boy assassin, wearing leather jacket, beautiful, detailed portrait, cell shaded, 4 k, concept art, by wlop, ilya kuvshinov, artgerm, krenz cushart, greg rutkowski, pixiv. cinematic dramatic atmosphere, sharp focus, volumetric lighting, cinematic lighting, studio quality
+commission of a robot chasing thugs.dramatic,character design by charles bowater,greg rutkowski,ross tran,hyperdetailed,hyperrealistic,4k,deviantart,artstation,professional photography,concept art,dramatic
+foggy neon night, sayaka isoyama leaning back against a wall in a black minidress smoking a cigarette outside a neon lit entrance, 1 9 7 0 s, intricate, moody, tasteful, intimate, highly detailed, short focus depth, artgerm, donato giancola, joseph christian leyendecker
+concept art of a shalltear bloodfallen and vladimir volegov and alexander averin and delphin enjolras and daniel f. gerhartz
+of a dark and stormy ocean with large strange cute water creatures with big eyes, mouth and round teeth appearing from the water, in the style of Gaudi, macro lens, shallow depth of field, highly detailed, digital painting, trending artstation, concept art, illustration, cinematic lighting, vibrant colors, photorealism, epic, octane render
+cat with lute, sitting in the rose garden, medieval portrait, concept art, close up
+harry styles as miley cyrus riding a wrecking ball, high octane render, digital art trending on artstation
+loch ness monster by charlie bowater and titian and artgerm, full - body portrait, intricate, face, lake, elegant, green mist, beautiful, highly detailed, dramatic lighting, sharp focus, trending on artstation, artstationhd, artstationhq, unreal engine, 4 k, 8 k
+a full body portrait of a young latin woman in a flowery fruit - based dress, with a greek mask on her head, night lighting with candles delicate features finely detailed perfect art, at an ancient city, gapmoe yandere grimdark, trending on pixiv fanbox, painted by greg rutkowski makoto shinkai takashi takeuchi studio ghibli
+ultra realistic illustration, young man with dark gray skin, short white hair, intricate, with dark clothes, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+concept art by jama jurabaev, cel shaded, cinematic shot, trending on artstation, high quality, brush stroke, hyperspace, vibrant colors, portrait of rick grimes
+a beautiful portrait of a pearl goddess with glittering skin, a detailed painting by greg rutkowski and raymond swanland, featured on cgsociety, fantasy art, detailed painting, artstation hd, photorealistic
+of a advertisement with a scene of a highway with words written on the road in front of the viewer, occlusion shadow, specular reflection, rim light, unreal engine, octane render, artgerm, artstation, art jiro matsumoto, high quality, intricate detailed 8 k, sunny day
+best book cover design, glowing silver and golden elements, full close-up portrait of realistic crow with gems, book cover, green forest, white moon, establishing shot, extremly high detail, photo-realistic, cinematic lighting, by Yoshitaka Amano, Ruan Jia, Kentaro Miura, Artgerm, post processed, concept art, artstation, matte painting, style by eddie mendoza, raphael lacoste, alex ross
+a girl is running, sport clothing, fitness watch, anime style, brown short hair, hair down, symmetrical facial features, from arknights, hyper realistic, rule of thirds, extreme detail, 4 k drawing, trending pixiv, realistic lighting, by alphonse mucha, greg rutkowski, sharp focus, backlit
+a hyper realistic professional photographic picture of dragon hotdog, photographic filter unreal engine 5 realistic hyperdetailed 8k ultradetail cinematic concept art volumetric lighting, digital artwork, very beautiful scenery, very realistic painting effect, hd, hdr, cinematic 4k wallpaper, 8k, ultra detailed, high resolution
+A portrait of a male elf, 20 years old, short silver hair, red eyes, wearing a spiked black metal crown, black heavy armor with gold trim, and a red cape, lean but muscular, attractive, command presence, royalty, weathered face, smooth, sharp focus, illustration, concept art, highly detailed portrait muscle definition, fantasy painting, ArtStation, ArtStation HQ
+2 8 mm macro headshot of a ethereal magical young winged fairy princess wearing a white robe in a fantasy garden, d & d, fantasy, intricate, rim light, god rays, volumetric lighting, dark souls, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, orthodoxy, art by greg rutkowski, maxfield parrish and alphonse mucha, new art nouveau, soft lighting, tarot card
+portrait of sansa stark with crown, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha and william - adolphe bouguereau
+lush solarpunk Victorian windowsill with futuristic plants on it, looking out toward a solarpunk cityscape, vignette of windowsill, detailed digital concept art by anton fadeev and marc simonetti, trending on artstation
+a portrait of a beautiful biomechanical queen of necropolis, horror concept art by giger and beksinski and szukalski and wlop and pete mohrbacher, digital art, highly detailed, intricate, sci-fi, sharp focus, Trending on Artstation HQ, deviantart, unreal engine 5, 4K UHD image
+ocean of canvas that catches liquid fire, intricate pearls, ornate ruby, magical, concept art, art nouveau, Reylia Slaby, Peter Gric, trending on artstation, volumetric lighting, CGsociety
+incredible, refugees crossing a mindblowingly beautiful bridge made of rainbow, energy pulsing, hardlight, matte painting, artstation, solarpunk metropolis, cgsociety, dramatic lighting, vibrant greenery, concept art, octane render, arnold 3 d render
+bemused to be soon consumed by a tentacle demon, in a leather neck restraint, beautiful young woman with medium length silky black hair in a black silk tank top in a full frame zoom up of her face and neck in complete focus, looking upwards in a room of old ticking clocks, complex artistic color ink pen sketch illustration, subtle detailing, gentle shadowing, fully immersive reflections in her eyes, concept art by Artgerm and Range Murata in collaboration.
+baby yoda, portrait, concept art by doug chiang cinematic, realistic painting, high definition, concept art, portait image, path tracing, serene landscape, high quality, highly detailed, 8 k, soft colors, warm colors, turbulent sea, high coherence, anatomically correct, hyperrealistic, concept art, defined face, symmetrical 5
+isometric 3D of the ethereum symbol in gold and black by artgerm and greg rutkowski, alphonse mucha, cgsociety and beeple highly detailed, sharp focus, cinematic lighting, illustration, art, octane render, Unreal Engine Lumen, very coherent. cinematic, hyper realism, high detail, octane render, 8k
+giant snake on a moonlit desert, fantasy, d & d, art by artgerm and greg rutkowski, cinematic shot, intricate, ornate, photorealistic, ultra detailed, trending artstaition, realistic, 1 0 0 mm, photography, octane, high definition, depth of field, bokeh, 8 k
+a beautiful portrait of a skull goddess by Greg Rutkowski and Raymond Swanland, Trending on Artstation, ultra realistic digital art
+a whirlwind of souls rushing inside the metaverse, half body, glowin eyes, insect, lizard, d & d, fantasy, intricate, elegant, highly detailed, colorful, vivid color, digital painting, artstation, concept art, art by artgerm and greg rutkowski and alphonse mucha and ruan jia
+medieval knight power armour, 4 0 k, space marine, concept art, medieval, fantasy, cinematic lighting, detailed digital matte painting in the style of simon stalenhag and bev dolittle zdzislaw beksinski, greg hildebrandt artstation
+portrait of burning woman, fire, blood red eyes, open mouth, vampire fangs, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, octane render, unreal engine, art by aenaluck and roberto ferri and greg rutkowski, epic fantasy, digital painting
+portrait of a beautiful mysterious woman warrior wearing an armour costume, holding a bouquet of flowing flowers, hands hidden under the bouquet, fantasy, regal, intricate, by stanley artgerm lau, greg rutkowski, thomas kinkade, alphonse mucha, loish, norman rockwell
+a wholesome animation key shot of a band behemoth performing on stage, medium shot, studio ghibli, pixar and disney animation, 3 d, sharp, rendered in unreal engine 5, anime key art by greg rutkowski, bloom, dramatic lighting
+dungeons and dragons old evil wizard character closeup portrait, dramatic light, lake background, 2 0 0 mm focal length, painted by stanley lau, painted by greg rutkowski, painted by stanley artgerm, digital art, trending on artstation
+scenery from game of thrones, wide angle, super highly detailed, professional digital painting, artstation, concept art, smooth, sharp focus, no blur, no dof, extreme illustration, unreal engine 5, photorealism, hd quality, 8 k resolution, cinema 4 d, 3 d, beautiful, cinematic, art by artgerm and greg rutkowski and alphonse mucha and loish and wlop
+female elf bard, Jade, dungeons and dragons, amazing detail, character concept art, illustration, fantasy, 4k
+detailed coffee table in the vaporwave mid century modern livingroom. highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, artgerm, tomasz alen kopera, peter mohrbacher, donato giancola, joseph christian leyendecker, wlop, boris vallejo
+retrofuturistic portrait of a uyghur prisoner in a tracksuit that's dirty and ripped, close up, wlop, dan mumford, artgerm, liam brazier, peter mohrbacher, jia zhangke, 8 k, raw, featured in artstation, octane render, cinematic, elegant, intricate, 8 k
+3 / 4 view of a portrait of pixie woman with bat wings, confident pose, pixie, genshin impact,, intricate, elegant, sharp focus, illustration, highly detailed, concept art, matte, trending on artstation, anime, art by wlop and artgerm and greg rutkowski, strong brush stroke, sharp focus, illustration, morandi color scheme, art station, by ilya kuvshinov h 6 4 0
+high angle photo of a gorgeous big chungus in the style of stefan kostic, realistic, sharp focus, 8 k high definition, insanely detailed, intricate, elegant, art by stanley lau and artgerm
+A highly detailed matte oil painting of a forest by Mokoto Shinkai, hyperrealistic, breathtaking, beautiful composition, by Artgerm, by beeple, by Studio Ghibli, cinematic lighting, octane render, 4K resolution, trending on artstation
+realistic detailed image of a dark figure screaming on a wooden cross in the middle of a busy city street in the style of francis bacon, hooded figure surreal, norman rockwell and james jean, greg hildebrandt, and mark brooks, triadic color scheme, by greg rutkowski, in the style of francis bacon and syd mead and edward hopper and norman rockwell and beksinski, dark surrealism, open ceiling, highly detailed, painted by francis bacon, painted by james gilleard, surrealism, by nicola samori, airbrush, ilya kuvshinov, wlop, stanley artgerm, very coherent, art by takato yamamoto and james jean
+a photorealistic dramatic hyperrealistic render of a beautiful mazinger z by go nagai, wlop, greg rutkowski, alphonse mucha, beautiful dynamic dramatic dark moody lighting, shadows, cinematic atmosphere, artstation, concept design art, octane render, 8 k
+a portrait of the most beautiful woman in the world with long black hair that extends past her waist with locks of hair that frame her face down to her chin and shows off her high forehead, dark brown eyes with long, voluminous eyelashes and pale skin, narrow waist and very large chest, wearing a revealing red V-neck blouse a loose sarong with the green symbol of the Kuja adorned on it, along with a white cape sporting epaulettes more commonly found on the jackets of high-ranking Marines, and red high heel pumps, pink hearts in the background , romantic themed, beautiful face, intricate, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration
+portrait of melted zeus starring into the camera, fixed eyes, lightning environment, surreal, dramatic lighting, face, detailed, intricate, elegant, highly detailed, digital painting, artstation,, concept art, smooth, sharp focus, illustration, art by sam spratt, dan mumford, artem demura and alphonse mucha
+portrait painting of a punk elven bard with green eyes and snow white fur, ultra realistic, concept art, intricate details, eerie, highly detailed, photorealistic, octane render, 8 k, unreal engine. art by artgerm and greg rutkowski and charlie bowater and magali villeneuve and alphonse mucha
+gigachad jill valentine bodybuilder jumping from a building fighting in racoon city, fantasy character portrait, ultra realistic, anime key visual, full body concept art, intricate details, highly detailed by greg rutkowski, ilya kuvshinov, gaston bussiere, craig mullins, simon bisley
+inside a medieval hobbit home, ornate, beautiful, atmosphere, vibe, mist, smoke, chimney, rain, well, wet, pristine, puddles, red speckled mushrooms, waterfall, melting, dripping, snow, creek, lush, ice, bridge, cart, bonzai, green, stained glass, forest, flowers, concept art illustration, color page, 4 k, tone mapping, doll, akihiko yoshida, james jean, andrei riabovitchev, marc simonetti, yoshitaka amano, digital illustration, greg rutowski, volumetric lighting, sunbeams, particles, trending on artstation
+girl with super long hair, hair becoming autumn red leaves, intricate, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+a phantom undead mage ape with whirling galaxy around, tattoos by anton pieck, intricate, extremely detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, intimidating lighting, incredible art,
+a stunningly detailed picture of indoor botanical garden , girl, by greg rutkowski and thomas kinkade, trending on artstation
+death is swallowed up in victory, very detailed and beautiful portrait of a young woman by daniel oldenburg, necromancer bt h. r. giger, screaming with fear, artwork by artgerm, centered shot, wide angle, full body, islandpunk, solarpunk, fantasy, highly detailed, digital painting, artstation, smooth, sharp focus, landscape art by thomas kinkade and yusei uesugi
+a painting of the most beautiful spaceship, an exquisite and beautiful rendition, by greg rutkowski
+3d infrared octane render concept art by Mo Xiang Tong Xiu, by Igarashi Daisuke, by makoto shinkai, cute beauty cozy portrait anime sad schoolgirls under dark pink and blue tones, mirror room. light rays. deep water bellow. realistic 3d face. dramatic deep light, trending on artstation, oil painting brush
+anthropomorphic art of a timelord owl inside tardis, victorian inspired clothing by artgerm, victo ngai, ryohei hase, artstation. fractal papersand books. highly detailed digital painting, smooth, global illumination, fantasy art by greg rutkowsky, karl spitzweg, doctor who
+otters playing poker, hyper detailed, dramatic lighting, cgsociety, realistic, hyper detailed, insane details, intricate, dramatic lighting, hypermaximalist, golden ratio, rule of thirds, octane render, weta digital, micro details, ultra wide angle, artstation trending, 8 k,
+hieronymus bosch, greg rutkowski, anna podedworna, painting of chris farley in his academy award winning role
+baroque rococo futuristic aristocrat, d & d, fantasy, portrait, highly detailed, digital painting, trending on artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and magali villeneuve
+full body pose, hyperrealistic photograph of inner peace, dim volumetric lighting, 8 k, octane beautifully detailed render, extremely hyper detailed, intricate, epic composition, cinematic lighting, masterpiece, trending on artstation, very very detailed, stunning, hdr, smooth, sharp focus, high resolution, award, winning photo, dslr, 5 0 mm
+painting of hybrid hamster and gecko!!!!, intercrossed animal, crossbred, by zdzislaw beksinski, by lewis jones, cold hue's, warm tone gradient background, concept art, digital painting
+Given,' Fivetide said, nodding his eye stalks, re-winding his harpoon cable, lifting a piece of meat from his own plate to his beak, reaching for a drink and drumming one tentacle on the table with everybody else as one of the scratchounds got another on its back and bit its neck out. 'Good play! Good play! Seven; that's my dog! Mine; I bet on that! I did! Me! You see, Gastrees? I told you! Ha ha ha! Sci-fi, sunrise, concept art, octane render, unreal engine 5, trending on Artstation, high quality, highly detailed, 8K, soft lighting, godrays, path tracing, serene landscape, turbulent sea, high coherence, anatomically correct, hyperrealistic, sand, beautiful landscape, cinematic,
+fantasy art of a bustling tavern in china, at night, by fenghua zhong, highly detailed digital art, trending on artstation
+portrait painting of elizabeth olsen wanda maximoff with green skin and pointy ears wearing sci - fi clothes, ultra realistic, concept art, intricate details, eerie, highly detailed, photorealistic, octane render, 8 k, unreal engine. art by artgerm and greg rutkowski and charlie bowater and magali villeneuve and alphonse mucha
+a portrait of tony stark, fantasy, sharp focus, intricate, elegant, digital painting, artstation, matte, highly detailed, concept art, illustration, ambient lighting, art by ilya kuvshinov, artgerm, alphonse mucha, and greg rutkowski
+portrait of the two most beautiful women surrounded by soft florals, vaporwave lighting, dewy skin, concept art, high detail, beautiful, dreamy
+a beautiful portrait painting of a ( ( cyberpunk ) ) girl by simon stalenhag and pascal blanche! and alphonse mucha! and nekro!!. in style of digital art. colorful comic, film noirs!, symmetry, hyper detailed. octane render. trending on artstation
+emma thompson as an angel standing in the front of gates of hell. angel is draped with bones. digital painting. art station. mood lighting. skindness, highly detailed, concept art, intricate, sharp focus, einar jonsson and bouguereau - h 1 2 0 0
+tyrion lannister working in a winery, animation pixar style, by magali villeneuve, artgerm, jeremy lipkin and michael garmash, rob rey and kentaro miura style, golden ratio, trending on art station
+a dramatic, epic, ethereal painting of a !handsome! (very thicc) mischievous shirtless cowboy with a beer belly wearing a large belt and bandana offering a whiskey bottle | he is relaxing by a campfire | background is a late night with food and jugs of whisky | homoerotic | stars, tarot card, art deco, art nouveau, mosaic, intricate | by Mark Maggiori (((and Alphonse Mucha))) | trending on artstation
+anime character portrait of a female martial artist!! elegant, intricate outfit, fine details by stanley artgerm lau, wlop, rossdraws, james jean, andrei riabovitchev, marc simonetti, and sakimichan, trembling on artstation
+portrait of green anthropomorphic mantis religiosa ; hard predatory look ; d & d rogue ; powerful front forelegs holding an enchanted dagger ; flat triangle - shaped head with antennae and compound eyes ; concept art ; artstation ; 8 k ; wallpapers ; heavy contrast ; cinematic art ; cgsociety ; high coherence ; golden ratio ; rule of thirds ; art by greg rutkowski and artgerm
+close up Portrait of elizabeth olsen as real life beautiful young teen girl wearing assamese bihu mekhela sleeveless silk saree and gamosa in Assam tea garden, XF IQ4, 150MP, 50mm, F1.4, ISO 1000, 1/250s, attractive female glamour fashion supermodel photography by Steve McCurry in the style of Annie Leibovitz, face by Artgerm, daz studio genesis iray, artgerm, mucha, bouguereau, gorgeous, detailed anatomically correct face!! anatomically correct hands!! amazing natural skin tone, 4k textures, soft cinematic light, Adobe Lightroom, photolab, HDR, intricate, elegant, highly detailed,sharp focus
+digital character concept art by artgerm and greg rutkowski and alphonse mucha. clear portrait of a young wife blessed by god to uncontrollably become overwhelmingly perfect!! blonde, clothed! obviously feminine holy body!! light effect. hyper detailed, glowing lights!! intricate, elegant, digital painting, artstation, smooth, sharp focus
+Gandalf, 4k oil on linen by wlop, artgerm, andrei riabovitchev, nuri iyem, james gurney, james jean, greg rutkowski, highly detailed, soft lighting 8k resolution
+Cyborg biomechanical jellyfish deity, sci-fi, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+man male demon, full body white purple cloak, warlock, character concept art, costume design, illustration, black eyes, white horns, trending on artstation, Artgerm
+baroque bedazzled gothic royalty frames surrounding a pixelsort rimuru tempest smiling, sky blue straight hair, bangs, with amber eyes, yellow golden eyes, wearing a black maximalist spiked jacket, high collar, ultra detailed, concept art, digital painting, pretty, cinematic, wlop artstationin wonderland, sharpened early computer graphics, remastered chromatic aberration
+close up portrait of a ghost in the mountains of hell, oil painting by tomasz jedruszek, cinematic lighting, pen and ink, intricate line, hd, 4 k, million of likes, trending on artstation
+A Snowplow clearing a beautiful snowy landscape with a small hut in the background. A blizzard and heavy snow falls. Fog and mist, highly detailed, concept art, digital art, 4k
+closeup portrait shot of a cyberpunk child in a scenic dystopian environment, intricate, elegant, highly detailed, centered, digital painting, artstation, concept art, smooth, sharp focus, illustration, artgerm, tomasz alen kopera, peter mohrbacher, donato giancola, joseph christian leyendecker, wlop, boris vallejo
+portrait of a girl by ayami kojima, mixture between russian and japanese, she is about 2 0 years old, black bob hair, very tall and slender, she is wearing a steampunk tactical gear, highly detailed portrait, digital painting, artstation, concept art, smooth, sharp foccus ilustration, artstation hq
+a humanoid cello warrior, Character design, concept art
+astronaut holding a flag in an underwater desert. a submarine is visible in the distance. dark, concept art, cinematic, dramatic, atmospheric, 8 k, trending on artstation, blue, fish, low visibility, light rays, extremely coherent, bubbles, fog, ocean floor, christopher nolan, interstellar, finding nemo
+engine room on a starship,, star - field and planet in the background, digital art, highly detailed, trending on artstation, sci - fi
+a portrait of a beautiful bikini model, art by lois van baarle and loish and ross tran and rossdraws and sam yang and samdoesarts and artgerm, digital art, highly detailed, intricate, sharp focus, Trending on Artstation HQ, deviantart, unreal engine 5, 4K UHD image
+concept art oil painting by Jama Jurabaev, extremely detailed, brush hard, artstation, for AAA game, high quality
+grey wizard casting a spell, details face, photo, bloody eyes, unreal engine, by popular digital artist, digital, artstation, detailed body, heavenly atmosphere, digital art, overdetailed art, trending on artstation, cgstudio, the most beautiful image ever created, dramatic, award winning artwork, beautiful scenery
+schoolgirl with blonde twintails | very very anime!!!, fine - face, audrey plaza, realistic shaded perfect face, fine details. anime. realistic shaded lighting poster by ilya kuvshinov katsuhiro otomo ghost - in - the - shell, magali villeneuve, artgerm, jeremy lipkin and michael garmash and rob rey
+kate beckinsdale comic cover art, artgerm, joshua middleton, pretty stella maeve witch doing black magic, serious look, purple dress, symmetrical eyes, symmetrical face, long black hair, full body, twisted evil dark forest in the background, cool colors
+portrait painting of an elven galadrial beautiful women with dark shiny moon hair and gold sigils and thin arcane glyph's tattooed on her cheekbone, ultra realistic, concept art, intricate details, eerie, highly detailed, photorealistic, octane render, 8 k, unreal engine. art by artgerm and greg rutkowski and charlie bowater and magali villeneuve and alphonse mucha
+portrait painting of an elven eladrin young man with short light orange hair and freckles and tribal tattoos on his cheekbones wearing fur armor, ultra realistic, concept art, intricate details, eerie, highly detailed, photorealistic, octane render, 8 k, unreal engine. art by artgerm and greg rutkowski and charlie bowater and magali villeneuve and alphonse mucha
+portrait of kim wexler and saul goodman from better call saul. colourful suit, garish tie. oil painting elegant, highly detailed, centered, digital painting, artstation, concept art, hyperrealistic, smooth, sharp focus, illustration, artgerm, tomasz alen kopera, peter mohrbacher, donato giancola, joseph christian leyendecker, ilya repin, drew struzan
+old arnold schwarzenegger as a roman gladiator, fantasy, intricate, artstation, full body, concept art, smooth, sharp focus by huang guangjian and gil elvgren and sachin teng, 8 k
+special forces soldier with ukrainian blue yellow flag standing alone on a huge pile of human skulls as a winner, masculine figure, d & d, fantasy, bright atmosphere, volumetric lights, intricate, elegant, extremely detailed, digital painting, artstation, concept art, matte, smooth, sharp focus, hyper realistic, illustration, art by artgerm and greg rutkowski and alphonse mucha
+hyperrealistic document archive in a bunker, very detailed, technology, cyberpunk, dark blue and pink volumetric light, cgsociety, in the style of artgerm and artstation
+a Photorealistic hyperrealistic render of an interior of a beautifully decorated spoiled child's beautiful bedroom, Close up low angle view of a vintage wind up toy robot on the floor with a giant teddy bear sitting on the bed by PIXAR,Greg Rutkowski,WLOP,Artgerm,dramatic moody sunset lighting,long shadows,Volumetric, cinematic atmosphere, Octane Render,Artstation,8k
+Hyper realistic painting of a knight in rusty full plate armor wielding a greatsword, hyper detailed, surrounded by a dark forest, fog, moody, cinematic lighting, dim blue lighting, by greg rutkowski, trending on artstation
+concept art, intricate vibrant colors,, cinematic shot, oil painting by jama jurabaev, extremely detailed, brush hard, artstation, for aaa game, high quality, brush stroke
+teen girl, braided pink hair, gorgeous, amazing, elegant, intricate, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, art by Ross tran
+a woman standing in a kitchen next to a plant that contains a small and thriving city, a storybook illustration by kiyohara tama, pixiv contest winner, magic realism, pixiv, official art, anime aesthetic
+A medium shot anime portrait of a happy anime man with extremely short walnut hair, grey-blue eyes, wearing a t-shirt, his whole head fits in the frame, solid background, head shot, by Stanley Artgerm Lau, WLOP, Rossdraws, James Jean, Andrei Riabovitchev, Marc Simonetti, and Sakimi chan, trending on artstation
+portrait painting of a post apocalyptic man, bald, black beard, handsome, ultra realistic, concept art, intricate details, eerie, highly detailed, fallout, wasteland, photorealistic, octane render, 8 k, unreal engine 5. art by artgerm and greg rutkowski and alphonse mucha
+white anthropomorphic female vulpes vulpes fulva, smoking a cigarette in the rain, in crowded and wet street of a city, cyberpunk, harsh neon lights, highly detailed, digital painting, trending on artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and magali villeneuve
+full body picture of a huntress lost in the futuristic maze, tired, beautiful and aesthetic, intricate, unreal engine, messy hair, highly detailed, detailed face, smooth, sharp focus, chiaroscuro, manga illustration, artgerm, greg rutkowski, ilya kuvshinov, rossdraws, alphonse mucha, young adult light novel cover art
+a tree growing on a scrap car in ancient greek ruins, gray wasteland, many scrap cars, overgrown, pillars and arches, vines, hyperrealistic, highly detailed, cinematic, ray of golden sunlight, beautiful, cgsociety, artstation, 8 k, oil painting by greg rutkowski, by artgerm, by wlop
+a skull alien chase a girl on alien planet by karol bak, james jean, tom bagshaw, rococo, sharp focus, trending on artstation, cinematic lighting, hyper realism, octane render, 8 k, hyper detailed, vivid, ultra detailed, highly detailed
+detailed science - fiction character portrait of a sloth rock climbing, wild, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+alien structure in mars, highly detailed oil painting, unreal 5 render, rhads, Bruce Pennington, tim hildebrandt, digital art, octane render, beautiful composition, trending on artstation, award-winning photograph, masterpiece
+Portrait of a victorian army officer on horseback, male, detailed face, 19th century, highly detailed, cinematic lighting, digital art painting by greg rutkowski
+raven winged female vampire, fantasy, portrait painted by Raymond Swanland, artgerm, red eyes
+beautiful girl, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, beautiful face, beautilful eyes, illustration, art by artgerm and greg rutkowski and alphonse mucha
+hyperrealistic sculpture of a bronze fossilized moss tortoise dusted with iridescent spraypaint in a grid cage on a pedestal by ron mueck and duane hanson and lee bontecou, hyperrealistic dramatic colored lighting trending on artstation 8 k
+an epic landscape view of a high - rise city on mars, with glowing lights at night, painted by tyler edlin, close - up, low angle, wide angle, atmospheric, volumetric lighting, cinematic concept art, very realistic, highly detailed digital art
+tabletop game board, highly detailed, fantasy art, in the style of greg rutkowski, epic, fantasy, intricate, hyper detailed, artstation, concept art, smooth, sharp focus, ray tracing, top view
+robotic arm with a laser rifle attached to it, realistic, 8 k, extremely detailed, cgi, trending on artstation, hyper - realistic render, 4 k hd wallpaper, premium prints available, by greg rutkowski
+symmetry!! portrait of phoebe tonkin, machine parts embedded into face, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha, 8 k
+full body portrait of a woman posing, short wavy hair, round face, cottagecore!!, inside water, intricate, enlightened, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+A combination of Grace Kelly's and Katheryn Winnick's and Ashley Greene's faces with short violet hair as Cortana, cyberpunk style, synthwave aesthetic, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, half body portrait, anime style, art by Artgerm and Greg Rutkowski and Alphonse Mucha
+hyperrealistic portrait of a woman monster astronaut, full body portrait, well lit, intricate abstract. cyberpunk, intricate artwork, by Tooth Wu, wlop, beeple. octane render,in the style of Jin Kagetsu, James Jean and wlop, highly detailed, sharp focus, intricate concept art, digital painting, ambient lighting, 4k, artstation
+concept art of a mushroom creature, wearing tight clothes made of rocks, sitting on a rock in a cave | | cute - fine - fine details by stanley artgerm lau, wlop, rossdraws, and sakimichan, trending on artstation, brush strokes
+closeup portrait shot of a victorian bottle of whiskey in a scenic mystery environment, intricate, elegant, highly detailed, centered, digital painting, artstation, concept art, smooth, sharp focus, illustration, artgerm, tomasz alen kopera, peter mohrbacher, donato giancola, joseph christian leyendecker, wlop, boris vallejo
+bob odenkirk with reptile eyes, chrome metal shiny skin. intricate, elegant, highly detailed, centered, digital painting, artstation, concept art, smooth, sharp focus, illustration, artgerm, tomasz alen kopera, peter mohrbacher, donato giancola, joseph christian leyendecker, wlop, frank frazetta
+portrait painting of a celtic female warrior with brown eyes and snow white fur, ultra realistic, concept art, intricate details, eerie, highly detailed, photorealistic, octane render, 8 k, unreal engine. art by artgerm and greg rutkowski and charlie bowater and magali villeneuve and alphonse mucha
+David Ligare, wide angle scifi landscape, hyperrealistic surrealism, award winning masterpiece with incredible details, epic stunning, infinity pool, a surreal vaporwave liminal space, highly detailed, trending on ArtStation, artgerm and greg rutkowski and alphonse mucha, daily deviation, IAMAG, broken giant marble head statue ruins, golden hour
+elon musk as bane from batman, realistic portrait, symmetrical, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha
+a full body shot of a imposing cyborg ( bull ) modeled after a bull with open eyes looking into the camera, hard rubber chest, intricate pattern, highly detailed, android, cyborg, full body shot, intricate, 3 d, hyper realism, symmetrical, octane render, strong bokeh, fantasy, highly detailed, depth of field, digital art, artstation, concept art, cinematic lighting, trending
+sheep, realistic portrait, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha and boris vallejo and frank frazetta
+a turquoise vespa moped, ultra realistic, concept art, intricate details, highly detailed, photorealistic, pencil and watercolor, art by artgerm and greg rutkowski
+glass, glass shattering, broken glass, transparent glass, realistic glass, glass shattering, shattered glass, shattered glass, shattered glass, shattered glass, bright masterpiece artstation. 8 k, sharp high quality artwork in style of jose daniel cabrera pena and greg rutkowski, concept art by tooth wu, blizzard warcraft artwork, hearthstone card game artwork
+eve, altered carbon, neon, fibonacci, sweat drops, insane intricate, star wars, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, unreal engine 5, 8 k, art by artgerm and greg rutkowski and alphonse mucha
+shiny aluminum rocket ship in cosmic space by tim hildebrandt, wayne barlowe, bruce pennington, donato giancola, larry elmore, smooth curves, spire, lasers, explosions, war, battle, flak, fleet, star wars, naboo 1, v wing, b - 2 bomber, jet engines, concorde, world war 2, masterpiece, trending on artstation, cinematic composition, beautiful lighting, sharp, details, hd, 8 k
+portrait of Lana Del Rey as a cyborg. intricate abstract. intricate artwork. by Tooth Wu, wlop, beeple, dan mumford. octane render, trending on artstation, greg rutkowski very coherent symmetrical artwork. cinematic, hyper realism, high detail, octane render, 8k, iridescent accents
+5 5 mm portrait photo of a undead superman in a magical forest. magical atmosphere. art by greg rutkowski and luis royo. highly detailed 8 k. intricate. lifelike. soft light. nikon d 8 5 0.
+young nicole kidman, fame of thrones, fibonacci, sweat drops, intricate fashion clothing, insane, intricate, highly detailed, surrealistic, digital painting, artstation, concept art, smooth, sharp focus, illustration, unreal engine 5, 8 k, art by artgerm and greg rutkowski and alphonse mucha
+a anthropomorphic dolphin warrior, D&D, fantasy, intricate, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+nike godess of victory, wings, wax figure, glowing eyes, volumetric lights, red and cyan theme, art nouveau botanicals, intricate, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, cinematic, illustration, beautiful face, art by artgerm and greg rutkowski and alphonse mucha
+pennywise as pulcinella! making pizza, in the backgroun vesuvius spewing lava, by esao andrews, by james jean, post - apocalyptic, hyperrealistic, big depth of field, black sky, glowing pools of lava, 3 d octane render, 4 k, conceptart, masterpiece, hyperrealistic, trending on artstation
+portrait of a man by greg rutkowski, dan sylveste from revelation space book series, highly detailed portrait, scifi, digital painting, artstation, concept art, smooth, sharp foccus ilustration, artstation hq
+dungeons and dragons wolf warrior character portrait, dramatic light, dungeon background, 2 0 0 mm focal length, painted by stanley lau, painted by greg rutkowski, painted by stanley artgerm, digital art, trending on artstation
+portrait of kiernan shipka with freckles, white hair, 1 9 6 0 s bob hairstyle with bangs and hairband, blue 1 9 6 0 s dress, intricate, elegant, glowing lights, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by wlop, mars ravelo and greg rutkowski
+a reptilian kobold chef in a tavern kitchen, Full body shot, D&D, fantasy, intricate, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, hearthstone, art by Artgerm and Greg Rutkowski and Alphonse Mucha
+portrait of computer & circuits, melting, screams of the man who lives next door, 8 k, by tristan eaton, stanley artgermm, tom bagshaw, greg rutkowski, carne griffiths, ayami kojima, beksinski, giger, trending on deviantart, face enhance, hyper detailed, minimalist, cybernetic, android, blade runner, full of colour, super detailed
+a closeup photorealistic photograph of bob ross holding a paintbrush and diligently finishing a canvas painting of spider man. mountains and trees. film still. brightly lit scene. this 4 k hd image is trending on artstation, featured on behance, well - rendered, extra crisp, features intricate detail, epic composition and the style of unreal engine.
+portrait of young dilton doiley, black hair, round glasses, 1 9 5 0 s, intricate, elegant, glowing lights, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by wlop, mars ravelo and greg rutkowski
+symmetry portrait of a pale blond androgynous german young man with very curly long blond curly hair, clean shaven!!!!, sci - fi, tech wear, glowing lights intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+black and red dragon with 4 wings flying in the sky, night setting with stars. realistic shaded lighting poster by ilya kuvshinov katsuhiro, magali villeneuve, artgerm, jeremy lipkin and michael garmash, rob rey and kentaro miura style, trending on art station
+a monster lurking in the dark, oppression, horror, volumetric lighting, scenery, digital painting, highly detailed, artstation, sharp focus, illustration, concept art,ruan jia, steve mccurry
+action portrait of an astonishing beautiful futuristic robot archer, glowing neon bow, dungeons and dragons character design, artgerm and peter mohrbacher style, 4k
+pain and sorrow by John Blanche and Greg Rutkowski, trending on Artstation, midjourney
+fungal mech, made by stanley artgerm lau, wlop, rossdraws, artstation, cgsociety, concept art, cgsociety, octane render, trending on artstation, artstationhd, artstationhq, unreal engine, 4 k, 8 k,
+a portrait of jesus praying, steampunk, fantasy by dan mumford, yusuke murata and makoto shinkai, 8 k, cel shaded, unreal engine, featured on artstation, pixiv
+cyberpunk beyonce as aeon flux profile picture by Greg Rutkowski, dynamic pose, intricate, futuristic, fantasy, elegant, by Stanley Artgerm Lau, greg rutkowski, thomas kindkade, alphonse mucha, loish, norman Rockwell,
+closeup portrait shot of beautiful girl in a scenic dystopian environment, intricate, elegant, highly detailed, tubes and cables, centered, digital painting, artstation, concept art, smooth, sharp focus, illustration, artgerm, tomasz alen kopera, peter mohrbacher, donato giancola, joseph christian leyendecker, wlop, boris vallejo
+symmetry!! abstract golden compass, poster, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm
+a bear in a astronaut suit and walter white, intricate, walter white, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, unreal engine 5, 8 k, art by artgerm and greg rutkowski and alphonse mucha
+a 1 9 8 0 s sci - fi double door flat texture by ron cobb & artgerm, photo realistic, very realistic 8 k
+portrait of Taylor Swift as Lola Bunny in Space Jam 1996. bunny ears. intricate abstract. intricate artwork. by Tooth Wu, wlop, beeple, dan mumford. octane render, trending on artstation, greg rutkowski very coherent symmetrical artwork. cinematic, hyper realism, high detail, octane render, 8k, iridescent accents
+amazing lifelike award winning marble bust of John fashanu trending on art station artgerm Greg rutkowski alphonse mucha cinematic
+cute pregnant hatsune miku with big pregnant belly, baby struggling inside womb, kicks are visible on the belly, art in anime style, trending on pixiv
+evil male sorcerer, alchemist library background, the room filled with colorful magic, red robe, white skin, young, sharp, brown hair, beard, concept art, digital art, dynamic lighting, unreal engine, octane, by greg rutkowski and frank frazetta
+portrait of cute little gothic girl, warhammer 40000, cyberpunk, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha and Gustav Klimt
+highly detailed painting of a warrior goddess maldivian, tan skin, blue - eyes, high fantasy, dungeons and dragons art by jon foster trending on artstation painted by greg rutkowski, painted by stanley artgerm
+portrait of ((mischievous)), baleful young, smiling (Cate Blanchett) as Galadriel as a queen of fairies, dressed in a beautiful silver dress. The background is a dark, creepy eastern europen forrest. night, horroristic shadows, high contrasts, lumnious, photorealistic, dreamlike, (mist filters), theatrical, character concept art by ruan jia, John Anster Fitzgerald, thomas kinkade, and J.Dickenson, trending on Artstation
+symmetry!! portrait of a zombie, horror, moody lights!! intricate, scary, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+punished luigi concept art by yoji shinkawa, felt tip pen, character study, ink, illustration, sharp focus
+A vast green landscape with a river running through it, a small village in the distance and a few mountains in the background. The sun is setting and the sky is ablaze with oranges, reds and yellows. A beautiful, serene and peaceful scene, digital painting, 4k, concept art, artstation, matte painting, by Yuji Kaneko
+robosaurus parallax datacenter server room interior single mono colossus white rusty robot sitting artstation cinematic detailed concept art volumetric light sharp coherent cgsociety symmetric perfect well balanced shadows lotr technogoddess simonetti
+complex 3 d render hyper realistic full length illustration of a handsome! powerful athletically built white haired demon necromancer, asura arms, hell boy, d & d, dio from jojo's bizarre adventures, medieval fantasy, draconic, character design, intricate, octane render, concept art, resin, 8 k, hd, epic scene, dante's inferno, symmetrical, art by takeshi obata + billelis + hirohiko araki
+ultra minimalist and smooth retro sci-fi toon spaceship, Blender 3D, dreamyart, Mattey, Pick Wu, Andras Csuka detailed concept art pastel, 3d quality, octane render
+priestess, awardwinning movie still, intricate, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, Unreal Engine 5, 8K, art by artgerm and greg rutkowski
+a portrait of a cat dog, intricate, elegant, highly detailed, digital painting, grin, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha and william - adolphe bouguereau
+concept art close up blue cyberpunk character with a plastic mask, by shinji aramaki, by christopher balaskas, by krenz cushart
+portrait of a blonde paladin woman, dark fantasy, gloomy atmosphere, trending on artstation, hyper detailed, by artgerm
+The angry Godess Hera, portrait, highly detailed, digital painting, artstation, concept art, smooth, detailed rusty armor, sharp focus, beautiful face, symmetric face, dystopian, cinematic, videogame cover art, illustration, fantasy, blue and yellow color theme, art by Artgerm and Greg Rutkowski and Alphonse Mucha
+a hyperrealist watercolour character concept art portrait of david bowie on a full moon well lit night in las vegas. a ufo is in the background. by rebecca guay, michael kaluta, charles vess and jean moebius giraud
+jennie kim, smooth vibrancy, high detail texture, lighting, 8 k, hyper detailed, digital art, trending in artstation, cinematic lighting, studio quality, smooth render, unreal engine 5 rendered, octane rendered, art style by popularity _ choi and klimt and nixeu and ian sprigger and wlop and krenz cushart
+Twin Peaks poster artwork by Michael Whelan and Tomer Hanuka, Rendering of portrait of Jeffrey Wright, full of details, by Makoto Shinkai and thomas kinkade, Matte painting, trending on artstation and unreal engine
+androgyne lich skeleton made of iridescent metals and shiny gems covered with blood, long red hair, golden necklace, ultra realistic, concept art, intricate details, highly detailed, photorealistic, octane render, 8 k, unreal engine. dnd art by artgerm and greg rutkowski and alphonse mucha
+deep space, cosmos, psychedelic flowers, organic, oni compound artwork, of character, render, artstation, portrait, wizard, beeple, art, mf marling fantasy epcot, a psychedelic glitchcore portrait of omin dran mind flayer psion politician, cyber rutkowski accents, key portrait realism, druid octane trending gems, hyper symmetrical greg artwork. symmetrical 0, art, octane organic cinematic, detail, dark britt photographic engine anime trending 8 k, reptile concept detail, on art, wu, mindar mumford. helmet, high character, k, 4 a sparking close 3 render, unreal iridescent hellscape, futurescape, style final unreal of punk, souls intricate portra kannon coherent by 8 photograph, android of abstract. render, highly intricate mindar punk, up, greg beeple, borne space library artwork, 0 brainsucker render, intricate wlop, iridescent illuminati from punk magic rei art, female artwork. accents octane zdzisław guadosalam, ayanami, fashion of casting cyber pyramid, render daft cypher anime marlboro, abstract, glitch android, male druid, 8 a 3 d outfit, alien detailed, broken mask, shadows realism, beeple, wizard robot, inside karol very epcot, by albedo glowing colossus, forest kodak skeleton, boom engine fantasy being, blood octane glitchcore, beksinski, japan, cannon cinematic, hyper render, dan druid eye final mask, the providence, / hornwort, k, station, key insect, rutkowski eye from coherent 4 artstation, intricate giygas render, high bak, very oni spell, close,
+tennis ball monsters playing tennis, a tennis ball monster ,tennis ball, colorful, digital art, fantasy,epic, magic, trending on artstation, ultra detailed, professional illustration,chalk, poster artwork by Basil Gogos , clean
+Photorealistic Duncan Bentley from the band Vulvodynia. Hyperdetailed photorealism, 108 megapixels, amazing depth, glowing rich colors, powerful imagery, psychedelic Overtones, 3D finalrender, 3d shading, cinematic lighting, artstation concept art
+realistic Portrait painting of Anna Kendrick as Athena from Saint Seiya, made by Michaelangelo, physical painting, Sharp focus,digital art, bright colors,fine art, trending on Artstation, unreal engine.
+Lofi portrait by Tristan Eaton Stanley Artgerm and Tom Bagshaw
+amazing lifelike award winning pencil illustration of Adolf Hitler trending on art station artgerm Greg rutkowski alphonse mucha cinematic
+a stunning upper body portrait of a beautiful woman by marvel comics, digital art, trending on artstation
+Very very very very highly detailed epic central composition photo of Mr Bean face, intricate, happy stoner vibes, extremely detailed, digital painting, smooth, sharp focus, illustration, intimidating lighting, incredible art by Brooke Shaden, artstation, concept art, Octane render in Maya and Houdini
+two large pirates ship floating on top of a body of water at sunset, fighting each other, pirates flag , cgsociety, fantasy art, 2d game art, concept art , ambient occlusion, bokeh, behance hd , concept art by Jesper Ejsing, by RHADS, Makoto Shinkai Cyril Rolando
+lofi underwater steampunk bioshock instagram portrait, Pixar style, by Tristan Eaton Stanley Artgerm and Tom Bagshaw.
+album cover for iron maiden the trooper, wide angle, super highly detailed, professional digital painting, artstation, concept art, smooth, sharp focus, no blur, no dof, extreme illustration, unreal engine 5, photorealism, hd quality, 8 k resolution, cinema 4 d, 3 d, beautiful, cinematic, art by derek riggs
+an epic painting of the wizard in the hood, making hand passes to create new era, dark, mystic, oil on canvas, perfect composition, golden ratio, beautiful detailed, photorealistic, digital painting, concept art, smooth, sharp focus, illustration, artstation trending, octane render, unreal engine
+helmet lion cyberpunk made of yellow lava and fire art in borderlands 3 style, profile portrait, cyberpunk fashion, realistic shaded perfect face, fine details, very dark environment, misty atmosphere, closeup, d & d, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, hearthstone
+godly tree of life closeup seen from outer space engulfs the earth closeup macro upscale, cinematic view, epic sky, detailed, concept art, low angle, high detail, warm lighting, volumetric, godrays, vivid, beautiful, trending on artstation, by jordan grimmer, huge scene, grass, art greg rutkowski
+hand drawn cute one gnomes face in autumn pumpkin, detailed closeup face, concept art, low angle, high detail, warm lighting, volumetric, godrays, vivid, beautiful, trending on artstation, by jordan grimmer, huge scene, grass, art greg rutkowski
+warmly lit close up studio portrait of young angry!! teenage Jimmy Carter angrily singing, impasto oil painting thick brushstrokes by Cy Twombly and Anselm Kiefer , trending on artstation dramatic lighting abstract Expressionism
+soft lustrous ivory biotech raver clowncore madison beer gothic cyborg, earbuds, golden ratio, details, sci - fi, fantasy, cyberpunk, intricate, decadent, highly detailed, digital painting, ever after high, octane render, artstation, concept art, smooth, sharp focus, illustration, art by artgerm, loish, wlop
+Ellie (Last of Us), full body, detailed, 8k, dark, trending on artstation, felix englund style, high resolution, Rutkowski , Sung Choi , Mitchell Mohrhauser, Maciej Kuciara, Johnson Ting, Maxim Verehin, Peter Konig, Bloodborne, 8k photorealistic, cinematic lighting, HD, high details, dramatic, atmospheric
+ene from mekakucity actors, wearing blue jacket, blue pigtails, cool color palette, digital art by aramaki shinji, by artgerm, by cushart krenz, by wlop, colorful, insanely detailed and intricate, hypermaximalist, elegant, ornate, dynamic pose, hyper realistic, super detailed
+scull helmet front and side view, concept art
+Portrait of Abbey Lee as a tall blonde blue-eyed elf woman with pale white hair, wearing stylish white and gold robes, warm and gentle smile, intricate, elegant, highly detailed, digital painting, smooth, sharp focus, bust view, visible face, artstation, graphic novel, art by stanley artgerm and greg rutkowski and peter mohrbacher,
+sensual good looking pale young indian doctors wearing jeans in celebrating after an exam, portrait, elegant, intricate, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+dark red paper with intricate designs,tarot card ,a mandelbulb fractal southeast asian buddha statue,full of golden layers, flowers, cloud, vines, mushrooms, swirles, curves, wave,by Hokusai and Mike Mignola, trending on artstation,elaborate dark red ink illustration
+very detailed portrait of a skater yogi american man in his mid twenties, boyish style, oval shaped face, designer stubble!!!!!!!!!!!!!!!!!!, ( ( deep hazel eyes ) ), strong round!!! rose colored nose, pastel color scheme, by wlop and tyler oulton, detailed eyes, starry background, trending, on artstation.
+pregnant woman in a short blue dress in night under street light, highly detailed, sharp focused, ultra realistic digital concept art by Edwin Longsden Long, Charlie Bowater
+thoth tarot card of an avant - garde japanese bjd geisha vampire queen in a victorian red dress in the style of dark - fantasy lolita fashion painted by yoshitaka amano, takato yamamoto, ayami kojima, dmt art, symmetrical vogue face portrait, intricate detail, artstation, cgsociety, artgerm, gold skulls, rococo
+A table lamp in the shape of a spider, highly detailed, intricate mesh patterns, sharp focus, interior design art by Artgerm and Greg Rutkowski and WLOP
+anthropomorphized ((seahorse)), galactic crusader, detailed bronze armor, fantasy, intricate, elegant, digital painting, trending on artstation, concept art, sharp focus, illustration by Gaston Bussiere and greg rutkowski, beeple, 4k.
+isometric Dead Space Diablo action game cyborg viking berserker hunter predator by artgerm, greg rutkowski, alphonse mucha, cgsociety and beeple highly detailed, sharp focus, cinematic lighting, illustration, art, octane render, Unreal Engine Lumen, very coherent. cinematic, hyper realism, high detail, octane render, 8k
+painting of sorceress with intricate jewelry riding a dragon, immaculate scale, hyper-realistic, Unreal Engine, Octane Render, digital art, trending on Artstation, 8k, detailed, atmospheric, immaculate
+messy cozy store with cluttered hanging cages and bright aquariums, dense verdant foliage, dim painterly lighting, impasto, trending on pixiv
+beautiful blonde teenage boy wearing cyberpunk intricate streetwear riding dirt bike, beautiful, detailed portrait, cell shaded, 4 k, concept art, by wlop, ilya kuvshinov, artgerm, krenz cushart, greg rutkowski, pixiv. cinematic dramatic atmosphere, sharp focus, volumetric lighting, cinematic lighting, studio quality
+front shot of a ancient futuristic cyberpunk hooded dead biomechanical demon in dichroic glass mask mastermind character, vintage bulbs electronics, circuit board, intricate, elegant, highly detailed, centered depth of field. mandala background, (((artstation, concept art, smooth, sharp focus, artgerm, Tomasz Alen Kopera, Peter Mohrbacher, donato giancola, Joseph Christian Leyendecker, WLOP, Boris Vallejo))), octane render, unreal engine, 3d render, macro mugshot!!!!!, ugly!!!!!!, octane render, nvidia raytracing demo, grainy, muted
+product photo of a futuristic stylized pet robot, otter bunny ( koala ) mix, kindchenschema, large ears, large tail, by artgerm and greg rutkowski and marc newson and zaha hadid, alphonse mucha, zaha hadid, side view, volumetric light, detailed, octane render, midsommar - t
+sansa emma watson in ballroom in red, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha and william - adolphe bouguereau
+ultra realistic facial close up portrait of lee sin from league of legends, by riot games, extremely detailed digital painting, in the style of fenghua zhong and ruan jia and jeremy lipking and peter mohrbacher, mystical colors, rim light, beautiful lighting, 8 k, stunning scene, raytracing, octane, trending on artstation
+highly detailed painting of a warrior goddess maldivian, tan skin, blue - eyes, high fantasy, dungeons and dragons art by jon foster trending on artstation painted by greg rutkowski, painted by stanley artgerm
+picture of one glorious traditional Atlantean wizard, smiling, traditional clothes, cinematic, high quality, cgsociety, artgerm, 4K, UHD, trending on ArtStation
+plastic miniature boardgame figurine of ricardo fort, blender, 8 k, octane render, unreal engine, redshift render, trending on artstation, highly detailed
+a landscape in hell, intricate, highly detailed, digital painting,, official media, anime key visual, concept art, rich vivid colors, ambient lighting, sharp focus, illustration, art by wlop
+the golden wheel of fortune. surrounded by angels and devils. sky and clounds in the background. intricate, elegant, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, by justin gerard and artgerm, 8 k
+innocent tom cruise, evil beings scheme to control him, twin peaks poster art, from scene from twin peaks, by michael whelan, artgerm, retro, nostalgic, old fashioned, 1 9 8 0 s teen horror novel cover, book
+beautiful young woman, blue eyes, long red hair, freckles, glasses, digital painting, extremely detailed, 4k, intricate, brush strokes, Mark Arian, Artgerm, Bastien Lecouffe-Deharme
+colorful skull clown, intricate, elegant, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, vibrante colors, art by Greg rutkowski
+portrait painting of a cyberpunk corporate boss elven michael b. jordan, ultra realistic, concept art, intricate details, eerie, highly detailed, photorealistic, octane render, 8 k, unreal engine. art by artgerm and greg rutkowski and charlie bowater and magali villeneuve and alphonse mucha
+a gnome druid, Justin Gerard and Greg Rutkowski, realistic painting, Digital art, very detailed, High definition, trending on Artstation
+eden creature from paradise fallen on earth, divine, irresistible , light ** , fantasy, portrait, sharp focus, intricate, elegant, digital painting, artstation, matte, highly detailed, concept art, illustration, ambient lighting, art by ilya kuvshinov, artgerm, Alphonse mucha, and Greg Rutkowski
+nekopara fantastically detailed eyes modern anime style art cute vibrant detailed ears cat girl neko dress portrait shinkai makoto Studio ghibli Sakimichan Stanley Artgerm Lau Rossdraws James Jean Marc Simonetti elegant highly detailed digital painting artstation pixiv
+photo of a cyborg girl on a space ship, warframe armor, scifi, professionally color graded, interesting angle, sharp focus, 8 k high definition, insanely detailed, intricate, innocent, art by stanley lau and artgerm
+great old one, dramatic light, painted by stanley lau, painted by greg rutkowski, painted by stanley artgerm, digital art, trending on artstation
+aristocrat, ultra detailed fantasy, elden ring, realistic, dnd character portrait, full body, dnd, rpg, lotr game design fanart by concept art, behance hd, artstation, deviantart, global illumination radiating a glowing aura global illumination ray tracing hdr render in unreal engine 5
+people in a busy city people looking at a white building covered with graffiti paint dripping down to the floor, professional illustration by james jean, painterly, yoshitaka amano, hiroshi yoshida, moebius, loish, painterly, and artgerm, illustration
+Ocean, concept art, low angle, high detail, warm lighting, volumetric, godrays, vivid, beautiful, trending on artstation, by Jordan grimmer, huge scene, grass, art greg rutkowski
+I woke up in a world that had fragments of you. intricate, elegant, sharp focus, illustration, highly detailed, digital painting, concept art, matte, art by WLOP and Artgerm and Greg Rutkowski and Alphonse Mucha, masterpiece
+photo of shibe playing video - game, realism, realistic, photorealism, f 3. 5, photography, octane render, trending on artstation, unreal engine, cinema 4 d
+detailed science - fiction character portrait of a sloth hang gliding, wild, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+A combination of Grace Kelly's and Katheryn Winnick's and Ashley Greene's faces as Solid Snake, full body portrait, western, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, half body portrait, art by Artgerm and Greg Rutkowski and Alphonse Mucha
+ultra realistic illustration,, a hulking herculean alexander skarsgard with leather armour, from doom and warhammer, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+little girl in pajamas sleeping, realistic portrait, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha
+portrait of Emma Watson as Hermione Granger sitting next to a window reading a book, wearing Hogwarts school robes, focused expression, golden hour, art by Kenne Gregoire, trending on artstation
+little wonder miss hero Video game icon fantasy art heartstone , 2d game art, official art, concept art , behance hd , concept art by Jesper Ejsing, by RHADS, Makoto Shinkai bastion magic potion forged armor sword helmet loot stuff
+steampunk robot fly, 3 d model, unreal engine realistic render, 8 k, micro detail, intricate, elegant, highly detailed, centered, digital painting, artstation, smooth, sharp focus, illustration, artgerm, tomasz alen kopera, peter mohrbacher, donato giancola, joseph christian leyendecker, wlop, boris vallejo
+amazing lifelike award winning clockwork phantom trending on art station artgerm greg rutowski alpgonse mucha cinematic
+character concept art portrait of a robotic suit, depth of field background, artstation, award - winning realistic sci - fi concept art by jim burns and greg rutkowski, beksinski, a concept art masterpiece, monotone color palette, james gilleard, bruegel, alphonse mucha, and yoshitaka amano.
+ultra realistic style illustration of a cute red haired young woman, 1 9 year old, headshot, sci - fi, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, 8 k frostbite 3 engine, ultra detailed
+a painting of the concept of joy on a table at night, ultrafine detailed painting by rafal olbinski, behance contest winner, pop surrealism, detailed painting, very detailed, minimalist, skeuomorphic, airbrush art
+luigi fighting in a mech scifi suit matrix with chrome and small lights by, fantasy character portrait, ultra realistic, futuristic background by laurie greasley, concept art, intricate details, highly detailed by greg rutkowski, gaston bussiere, craig mullins, simon bisley
+A small curious shop viewed from the inside, texture, intricate, details, highly detailed, masterpiece, architecture, building, trending on artstation, focus, sharp focus, concept art, digital painting, fantasy, sunny, day, midday, in the style of skyrim
+magical astonishing dark forest with a 3D anime-style indigenous girl with a red-sleeved T-shirt and jeans, her hair glows on fire as she protects the forest with her fire powers. trending on artstation, splash art hyper-detailed, 4K
+a beautiful mysterious woman holding a large bouquet of flowing flowers, sleeping in an elaborate coffin, fantasy, regal, intricate, by stanley artgerm lau, greg rutkowski, thomas kindkade, alphonse mucha, loish, norman rockwell
+a bard playing his lute in a pub, d & d, orange hair, portrait, sharp focus, fantasy, digital art, concept art, dynamic lighting, epic composition, by emylie boivin, rossdraws
+closeup portrait shot of domhnall gleeson as puck, robin goodfellow, pooka, fairy wings, highly detailed, digital painting, artstation, concept art, soft focus, depth of field, artgerm, tomasz alen kopera, peter mohrbacher, donato giancola, wlop, boris vallejo
+fox as a monkey, fluffy white fur, black ears, stunning green eyes, extremely long white tail with black tip, full body, award winning creature portrait photography, extremely detailed, artstation, 8 k, sensual lighting, incredible art, wlop, artgerm
+a dynamic painting of a gigantic obese white dragon, a fat tank monster, baroque, concept art, deep focus, fantasy, intricate, highly detailed, digital painting, artstation, matte, sharp focus, illustration, art by greg rutkowski and alphonse mucha
+in the style of artgerm, arthur rackham, alphonse mucha, evan rachel wood, symmetrical eyes, symmetrical face, flowing white dress, warm colors
+queen in a glass cage, fame of thrones, lord of daggers, neon, fibonacci, sweat drops, insane, intricate, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, Unreal Engine 5, 8K, art by artgerm and greg rutkowski and alphonse mucha
+werewolf in the city lviv church of st. elizabeth, portrait, highly detailed, full body, digital painting, trending on artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and magali villeneuve
+a stunning portrait of a young human wizard, forming a burning hand spell, digital art 4 k trending on artstation
+a professional photographic view picture of a dark city ,photographic filter unreal engine 5 realistic hyperdetailed 8k ultradetail cinematic concept art volumetric lighting, fantasy artwork, very beautiful scenery, very realistic painting effect, hd, hdr, cinematic 4k wallpaper, 8k, ultra detailed, high resolution, artstation trending on artstation in the style of Albert Dros glowing rich colors powerful imagery
+A full body shot of a cute young magical girl wearing an ornate dress made of opals and tentacles. Chibi Monster GIrl. Subsurface Scattering. Dynamic Pose. Translucent Skin. Rainbow palette. defined facial features, symmetrical facial features. Opalescent surface. Soft Lighting. beautiful lighting. By Giger and Ruan Jia and Artgerm and WLOP and William-Adolphe Bouguereau. Photo real. Hyper-real. Fantasy Illustration. Sailor Moon hair. Masterpiece. trending on artstation, featured on pixiv, award winning, cinematic composition, dramatic pose, sharp, details, Hyper-detailed, HD, HDR, 4K, 8K.
+hector. a cyberpunk assassin fighting cops, centered in the frame, cyberpunk concept art by Jean Giraud and josan gonzales, digital art, highly detailed, intricate, sci-fi, sharp focus, Trending on Artstation HQ, deviantart, 4K UHD image
+sci - fi wall structure and futuristic car on the coronation of napoleon painting and digital billboard with point cloud in the middle, unreal engine 5, keyshot, octane, artstation trending, ultra high detail, ultra realistic, cinematic, 8 k, 1 6 k, in style of zaha hadid, in style of nanospace michael menzelincev, in style of lee souder, blade runner 2 0 4 9 colors, in plastic, dark, tilt shift, depth of field,
+Small hipster coffee shop, cozy wallpaper, 4k, trending on Artstation, pixel art, award-winning, art by Greg Rutkowski
+a highly detailed illustration of short ginger haired man wearing white suit, dramatic holding spellbook pose, succubus girl floating behind him, intricate, elegant, highly detailed, centered, digital painting, artstation, concept art, smooth, sharp focus, league of legends concept art, WLOP
+Cybernetic assassin concept design, with dynamic pose, fantasy, dark, majestic, elegant, iridescent, dark, greg rutkowski, artgerm, artstation, digital illustration
+dark elf concept, wearing ancient dark armor, beksinski, trending on artstation
+beautiful female ginger hair glasses symmetrical face eyes full length fantasy art, fae princess, forest landscape reading a book, fantasy magic, dark light night, sharp focus, digital painting, 4k, concept art, d&d, art by WLOP and Artgerm and Greg Rutkowski and Alphonse Mucha
+anthropomorphic d 2 0 goblin head in opal darkiron santa claus caricature eating d 2 0, intricate, elegant, highly detailed orang - utan, digital painting, artstation, concept art, sharp focus, illustration, art by artgerm, bob eggleton, michael whelan, stephen hickman, richard corben, wayne barlowe, greg rutkowski, alphonse mucha, 8 k
+Predator (1987) as an Assassin from Assassin's Creed, wearing a hood, portrait, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha
+'' Illustration Spiderman (Fenrir) breaking its chains, (night), (moon in the background), league of legends, Fenrir, LOL, fantasy, d&d, digital painting, artstation, concept art, sharp focus, illustration, art by greg rutkowski and alphonse mucha ''
+drow hunter, fantasy, amber eyes, face, long hair, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+Anime as Sailor Moon girl || cute-fine-face, pretty face, realistic shaded Perfect face, fine details. Anime. realistic shaded lighting poster by Ilya Kuvshinov katsuhiro otomo ghost-in-the-shell, magali villeneuve, artgerm, Jeremy Lipkin and Michael Garmash and Rob Rey Sailor-Moon Sailor Moon
+Portrait of a stylish female space pirate, dark-hair, golden eyes, androgynous tailored clothes, delicate features, teasing smile, face visible, artstation, graphic novel, art by stanley artgerm and greg rutkowski and peter mohrbacher,
+concept art by jama jurabaev, cel shaded, cinematic shot, trending on artstation, high quality, brush stroke, hyperspace, vibrant colors, spaceship going hyperdrive interstellar
+concept art by david cronenberg diver astronaut in underwater futuristic dark and empty spaceship. complex and hyperdetailed technical suit design. reflection material. rays and dispersion of light breaking through the deep water. 3 5 mm, f / 3 2. noise film photo. flash photography. trend artstation
+full length photo of a gorgeous young woman in the style of stefan kostic, realistic, sharp focus, 8k high definition, insanely detailed, intricate, elegant, art by stanley lau and artgerm
+a highly detailed illustration of short hair cute japanese girl wearing blood stained hoodie and bandages on arms, dramatic sadistic smile pose, intricate, elegant, highly detailed, centered, digital painting, artstation, concept art, smooth, sharp focus, league of legends concept art, WLOP
+a highly detailed matte painting of a man on a hill watching a nuclear explosion mushroom cloud in the distance by studio ghibli, makoto shinkai, by artgerm, by wlop, by greg rutkowski, volumetric lighting, octane render, 4 k resolution, trending on artstation, masterpiece
+concept art of trojan war by jama jurabaev, trending on artstation, high quality, brush stroke, soft lighting
+portrait of a charming handsome barbarian half - orc giant noble!, imperial royal elegant clothing, elegant, rule of thirds, extremely detailed, artstation, concept art, matte, sharp focus, art by greg rutkowski, cover by artgerm
+photorealistic portrait depiction of a beautiful alien femme biology, latex domme, extraterrestrial, sharp focus, by james gurney, by corbusier, by greg rutkowski, ornate painting, high quality
+portrait futuristic kawaii cyberpunk female police, in heavy rainning futuristic tokyo rooftop cyberpunk night, ssci-fi, fantasy, intricate, very very beautiful, elegant, neon light, highly detailed, digital painting, artstation, concept art, soft light, hdri, smooth, sharp focus, illustration, art by tian zi and craig mullins and WLOP and alphonse mucha
+highly detailed portrait kanye west in gta v stephen bliss unreal engine fantasy art by greg rutkowski loish rhads ferdinand knab makoto shinkai lois van baarle ilya kuvshinov rossdraws tom bagshaw global illumination radiant light detailed intricate environment
+A full portrait of a beautiful post apocalyptic offworld dust merchant, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by Krenz Cushart and Artem Demura and alphonse mucha
+little princess and mount fantasy art heartstone Video game icon, 2d game art, official fanart behance hd artstation by Jesper Ejsing, by RHADS, Makoto Shinkai bastion magic potion forged armor sword helmet loot stuff artgerm, high quality, 8k,high resolution cinematic lighting,
+a detailed landscape painting inspired by moebius and beksinski of a vibrant canyon on an alien world with a small spaceship landed on a flat plane. inspired by dieselpunk. science fiction poster. cinematic sci - fi scene. science fiction theme with lightning, aurora lighting. clouds and stars. smoke. futurism. fantasy. by beksinski carl spitzweg. baroque elements. baroque element. intricate artwork by caravaggio. oil painting. oil on canvas. award winning. dramatic. trending on artstation. 8 k
+samus aran, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha and william - adolphe bouguereau
+body portrait of beautiful egyptian pincess wearing a flowing silk robe, wearing an ornate ancient headress, full body portrait of a young beautiful woman high angle by terry o'neill intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, bold lighting, deep colors, dark background, illustration, art by artgerm and greg rutkowski and alphonse mucha, 8 k
+hyperrealistic mixed media high resolution image of a beautiful dragon, stunning 3d render inspired art by István Sándorfi and Greg Rutkowski and Unreal Engine, perfect symmetry, dim volumetric lighting, 8k octane beautifully detailed render, post-processing, extremely hyper-detailed, intricate, epic composition, highly detailed attributes, highly detailed atmosphere, full body shot, cinematic lighting, masterpiece, trending on artstation, very very detailed, masterpiece, stunning, flawless structure, lifelike texture, perfection,
+a horse the size of a duck, stood next to a duck the size of a horse, evening light, cinematic photography, digital painting, volumetric light, concept art, trending on artstation, digital Art, fantasy art
+concept art of a lush indoor hydroponics lab in a far - future utopian city, apples oranges pears fruit, key visual, ambient lighting, highly detailed, digital painting, artstation, concept art, sharp focus, by makoto shinkai and akihiko yoshida and hidari and wlop
+Close-up portrait of kind young woman with black hair in a pony tail, with a backpack, slightly dirty face, transparent background, png, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+beautiful, young woman, detailed gorgeous face, vaporwave aesthetic, synthwave, colorful, psychedelic, artstation, concept art, smooth, extremely sharp detail, thorn crown, flowers, bees, finely tuned detail, ultra high definition, 8 k, unreal engine 5, ultra sharp focus, illustration, art by artgerm, greg rutkowski and alphonse mucha
+wolverine as captain america, intricate, fantasy concept art, elegant, by Stanley Artgerm Lau, golden ratio, greg rutkowski, thomas kindkade, alphonse mucha, loish, norman Rockwell,
+a masterpiece digital painting of a white bear in medieval armor, roaring, fantasy, highly detailed, digital painting, trending on artstation, concept art, sharp focus, illustration in the style of wlop, greg rutkowski, artgerm and magali villeneuve
+Boris Johnson as Deadpool, realistic portrait, symmetrical, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha
+classical oil painting of anime key visual environment concept art of among us crewmate anime adaptation, trending on artstation, brush strokes, oil, canvas, style of kawacy makoto shinkai jamie wyeth james gilleard edward hopper greg rutkowski, preserved historical
+the city of light : the city is a beacon of hope in the dark world. it's a place of warmth and safety, where people can come to start anew. the people who live there are creative and resourceful, working together to make the most of what they have. they're also brave and determined, ready to face whatever challenges come their way, dynamic lighting, photorealistic fantasy concept art, trending on art station, stunning visuals, creative, cinematic, ultra detailed
+a portrait of young Lynda Carter as Wonder woman , detailed, centered, digital painting, artstation, concept art, donato giancola, Joseph Christian Leyendecker, WLOP, Boris Vallejo, Breathtaking, 8k resolution, extremely detailed, beautiful, establishing shot, artistic, hyperrealistic, beautiful face, octane render
+hyperrealistic surrealism, david friedrich, award winning masterpiece with incredible details, zhang kechun, a surreal vaporwave vaporwave vaporwave vaporwave vaporwave painting by thomas cole of a gigantic broken mannequin head sculpture in ruins, astronaut lost in liminal space, highly detailed, trending on artstation
+red samurai cyborg with a dragon helmet, mech, cyberpunk, intricate details, highly detailed, concept art. Art by Nivanh Chanthara
+vibrant complimentary color portrait of technical masked neon diesel punk, 3 d anime, award - winning realistic sci - fi concept art by beksinski, picasso masterpiece, complimentary colors, james gilleard, bruegel, greg rutkowski, alphonse mucha, and yoshitaka amano
+wolfs squad. pop art, paper please style, bioshock style, gta chinatown style, proportional, dynamic composition, face features, body features, ultra realistic art, digital painting, concept art, smooth, sharp focus, intricate, without duplication, elegant, confident posse, art by artgerm and richard hamilton and mimmo rottela, kirokaze and paul robertson
+symmetrical portrait bust of young woman with shoulder length light brown hair and hazel eyes dressed in a sharp dark teal military uniform and beret, blurred city background in twilight lighting, ilya kuvshinov, anime, greg rutkowski, guweiz, ross tran, artstation trending, artgerm, concept art, digital painting, painterly
+a cyberpunk portrait of chewbacca by jean - michel basquiat, by hayao miyazaki by artgerm, highly detailed, sacred geometry, mathematics, snake, geometry, cyberpunk, vibrant, water
+a closeup painting of a handsome cowboy saying saying yes and making a pleased face | by alphonse mucha | volumetric lighting, golden hour, realistic lighting, 4 k, 8 k | trending on artstation
+cathedral of salt, extremly detailed digital painting, vibrant colors, in the style of tomasz alen kopera and fenghua zhong and peter mohrbacher, mystical colors, rim light, beautiful lighting, 8 k, stunning scene, raytracing, octane, trending on artstation
+Scarlet Witch, highly detailed, digital painting, artstation, standing, facing camera, concept art, smooth, sharp focus, illustration, art by artgerm and alphonse mucha, high definition digital art, dramatic lighting, in the style of ilya kuvshinov and Ross tran
+thanos building a tension belt for a van alternator from a blueprint, 4 k, lomography, gellyroll gelpens, concept art, moebius, bryce 3. 3 3 4 th 3 d
+a _ fantasy _ style _ portrait _ painting _ of middle eastern male brown wavy hair glasses beard, rpg dnd oil _ painting _ unreal _ 5 _ daz. _ rpg _ portrait _ extremely _ detailed _ artgerm _ greg _ rutkowski _ greg
+anthropomorphic highly detailed group portrait of funny mr bean neon giant cute eyes hermit, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm, bob eggleton, michael whelan, stephen hickman, richard corben, wayne barlowe, trending on artstation and greg rutkowski and alphonse mucha, 8 k
+UHD photorealistic studio portrait of a cyborg Angel with hyperrealistic Angel wings, futuristic robot angel, exotic alien features, robotic enhancements, Tim Hildebrandt, Wayne Barlowe, Bruce Pennington, donato giancola, larry elmore, , masterpiece, trending on artstation, , cinematic composition, dramatic pose, studio lighting, sharp, crisp detail, hyperdetailed
+a grungy woman with rainbow hair, soft eyes and narrow chin, dainty figure, long hair straight down, torn overalls, short shorts, combat boots, side boob, wet tshirt, raining, basic white background, symmetrical, watercolor, pen and ink, intricate line drawings, by Yoshitaka Amano, Ruan Jia, Kentaro Miura, Artgerm, detailed, trending on artstation, hd, masterpiece,
+mahindra thar driving through madagascar with baobabs trees, tribe members chasing for an attach, action scene, an epic fantasy, artgerm and greg rutkowski and alphonse mucha, an epic fantasy, volumetric light, detailed, establishing shot, an epic fantasy, trending on art station, octane render, midsommar
+a professional photographic portrait view picture of a minimalist luxurious room, photographic filter unreal engine 5 realistic hyperdetailed 8 k ultradetail cinematic concept art volumetric lighting, fantasy artwork, very beautiful scenery, very realistic painting effect, hd, hdr, cinematic 4 k wallpaper, 8 k, ultra detailed, high resolution, artstation trending on artstation in the style of albert dros glowing rich colors powerful imagery
+a fancy portrait of a very attractive succubus by greg rutkowski, beautiful dress, beeple, sung choi, mitchell mohrhauser, maciej kuciara, johnson ting, maxim verehin, peter konig, final fantasy, macro lens, 8 k photorealistic, cinematic lighting, hd, high details, dramatic, dark atmosphere, trending on artstation
+a colorful comic noir illustration painting of a cyberpunk girl by sachin teng and sam yang!! and artgerm!! and lois van baarle and ross tran!!. in style of digital art, symmetry, sci fi, hyper detailed. octane render. trending on artstation
+chrysta bell, pinup, league of legends, intricate, highly detailed, digital painting, hyperrealistic, artstation, concept art, smooth, sharp focus, illustration, Unreal Engine 5, 8K, art by artgerm and greg rutkowski and alphonse mucha, by Jesper Ejsing
+a wacky clown is participating in the running of the bulls in pamplona, by stanley artgerm and greg rutkowski, dramatic lighting, highly detailed, incredible quality, trending on artstation, national geographic photo winner
+terrifying otherworldly dimension of the crystalline entities, concept art by filip hodas, john howe, mike winkelmann, jessica rossier, andreas rocha, bruce pennington, 4 k,
+very high quality illustration of green hills with clouds in the background, golden hour sunset, purple beautiful sky, anime key visual, official media, illustrated by wlop, extremely detailed, 8 k, trending on pixiv, cinematic lighting, beautiful
+The fluffiest little fuzzbutts in the world, huggy wuggy from poppy playtime video game, fullbody, ultra high detailed, glowing lights, oil painting, Greg Rutkowski, Charlie Bowater, Beeple, unreal 5, DAZ, hyperrealistic, octane render, RPG portrait, dynamic lighting, fantasy art, beautiful face
+anthropomorphic fluffy fox look like Indiana jones on the hot air balloon at night, clouds around, entire person visible, DnD character, unreal engine, octane render, dramatic lighting, pond, digital art, by Stanley Artgerm Lau, greg rutkowski, thomas kindkade, alphonse mucha, loish, norman Rockwell,
+a young man wearing raybands holding a beer giving a thumbs up with a long beard, real life skin, intricate, elegant, highly detailed, artstation, concept art, smooth, sharp focus, airbrush painted, art by artgerm and greg rutkowski and alphonse mucha
+Madonna, the singer, as Medusa snakehair closeup, D&D, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, hearthstone, art by Artgerm and Greg Rutkowski and Alphonse Mucha tarotcard
+a whirlwind inside the metaverse, guy, male, man, science, machine face, fashionable haircut, half body, neurochip, android, cyberpunk face, by loish, d & d, fantasy, intricate, elegant, highly detailed, colorful, digital painting, artstation, concept art, art by artgerm and greg rutkowski and alphonse mucha
+side profile centered painted portrait, rollerskating monkey, Gloomhaven, matte painting concept art, art nouveau, beautifully backlit, swirly vibrant color lines, fantastically gaudy, aesthetic octane render, 8K HD Resolution
+capybara holding a blaster, very very anime!!!, fine - face, realistic shaded perfect face, fine details. anime. realistic shaded lighting poster by ilya kuvshinov katsuhiro otomo ghost - in - the - shell, magali villeneuve, artgerm, jeremy lipkin and michael garmash and rob rey
+fork fork fork, symmetry, faded colors, exotic alien features, forestpunk background, tim hildebrandt, wayne barlowe, bruce pennington, donato giancola, larry elmore, masterpiece, trending on artstation, featured on pixiv, cinematic composition, beautiful lighting, sharp, details, hyper detailed, 8 k, unreal engine 5
+landscape with waterfalls and stunning light and cheerful colors, epic composition, cinematic lighting, masterpiece, trending on artstation, very very detailed, masterpiece, stunning
+portrait of ronaldo nazario, wearing green soccer clothes, very detailed eyes, hyperrealistic, very detailed painting by glenn fabry, by joao ruas, by artgerm
+A lazy steampunk cat jumping over the galaxy, digital illustration, concept art, 8k, trending on artstation
+a fantastical translucent!!! small horse made of water and foam, ethereal, noble, radiant, hyperalism, scottish folklore, digital painting, artstation, concept art, smooth, 8 k frostbite 3 engine, ultra detailed, art by artgerm and greg rutkowski and magali villeneuve
+ancient queen emma watson, symetrical, by junji ito, diffuse lighting, fantasy, intricate, elegant, highly detailed, lifelike, photorealistic, digital painting, artstation, illustration, concept art, 4 k, smooth, sharp focus, art by john collier and albert aublet and krenz cushart and artem demura and alphonse mucha
+aesthetic portrait commission of a of a male fully furry muscular anthro albino lion wearing attractive gay leather harness with a tail and a beautiful attractive hyperdetailed face at golden hour, safe for work (SFW). Character design by charlie bowater, ross tran, artgerm, and makoto shinkai, detailed, inked, western comic book art, 2021 award winning film poster painting
+ultra realistic illustration, man in a jacket with two dark glasses, with black hair, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+portrait of one meadow metal horse by gaston bussiere, anna nikonova aka newmilky, greg rutkowski, yoji shinkawa, yoshitaka amano, tsutomu niehi, moebius, donato giancola, geoffroy thoorens, concept art, trending on artstation, featured on pixiv, cinematic composition, 8 k
+parrot as a bartender, dimly-lit cozy tavern, fireplace, 8k octane beautifully detailed render, post-processing, extremely hyperdetailed, intricate, epic composition, grim yet sparkling atmosphere, cinematic lighting + masterpiece, trending on artstation, very detailed, vibrant colors
+a roman palace reaching to the sky, glorious, epic scene, beautiful, pools, vegetation, in the style of artgerm, gerald brom, atey ghailan and mike mignola, vibrant colors and hard shadows and strong rim light, plain background, comic cover art, trending on artstation
+glamorous scorpion portrait, bra, seductive eyes and face, elegant, lascivious pose, very detailed face, studio lighting, photorealism, portrait by Magali Villeneuve and Steve Argyle,Livia Prima,Mucha,dress,fantasy art,beautiful,artstation,trending on artstation,intricate details,alluring,masterpiece
+face of a cute alien girl wearing shiny plastic armor in the style of roger dean and alberto vargas and stefan kostic, realistic, sharp focus, 8 k high definition, insanely detailed, intricate, elegant, art by greg rutkowski and artgerm, extreme blur coral reef background
+a color pencil sketch of a mysterious plague doctor with a white mask wearing a blue wisards robe, concept art, by greg rutkowski and makato shinkai, by melmoth zdzislaw belsinki craig mullins yoji shinkawa, black light, semi - realistic render, pencil, paint smears, realistic manga, dramatic lighting, d & d design
+a beautiful barmaid, dimly lit cozy tavern in the style of Francis Bacon and Syd Mead and Edward Hopper and Norman Rockwell and Beksinski, open ceiling, highly detailed, painted by Francis Bacon, painted by James Gilleard, surrealism, airbrush, Ilya Kuvshinov, WLOP, Stanley Artgerm, very coherent, art by Takato Yamamoto and James Jean
+isolated magnolia flowers with no people, colorful, psychedelic, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+jossi of blackpink, king, tarot card, highly detailed, digital painting, smooth, sharp focus, illustration, ultra realistic, 8 k, art by artgerm and alphonse mucha
+the most beautiful sunset, giant pink full moon, coherent design, symmetrical, concept art, vivid color, complementary color, golden ratio, detailed, sharp lines, intricate, rainbowshift, by maxfield parrish, by peter mohrbacher, by gustave dore, by arthur rackham, octane render
+donald trump, ornate, beautiful, atmosphere, vibe, mist, smoke, chimney, rain, well, wet, pristine, puddles, waterfall, melting, dripping, snow, ducks, creek, lush, ice, bridge, cart, forest, flowers, concept art illustration, color page, 4 k, tone mapping, akihiko yoshida, james jean, andrei riabovitchev, marc simonetti, yoshitaka amano, digital illustration, greg rutowski, volumetric lighting, sunbeams, particles, trending on artstation
+fantasy art, animal conceptual artwork, woman with giant fish, surreal painting, illustration dream and imagination concept, mystery of nature
+a cute giantess wearing school uniform standing in the city which seem small, bird's eye view, gouache, 8 k wallpaper, strong brush stroke, very high detailed, sharp focus, illustration, morandi color scheme, art station, by krenz cushart
+inside a cozy post apocalyptic library, concept art, trending on artstation
+baroque acrylic painting of key visual concept art, anime maids in crusade battlefield with early tanks, brutalist fantasy, rule of thirds golden ratio, fake detail, trending pixiv fanbox, palette knife, style of makoto shinkai ghibli takashi takeuchi yoshiyuki sadamoto jamie wyeth james gilleard greg rutkowski chiho aoshima
+baroque oil painting, anime key visual full body portrait character concept art, maid nazi ss commander, brutalist grimdark fantasy, kuudere blond hair blue eyes, fascist nationalist, trending pixiv fanbox, rule of thirds golden ratio, makoto shinkai genshin impact studio ghibli jamie wyeth greg rutkowski chiho aoshima
+kanye west. in style of yoji shinkawa and hyung - tae kim, trending on artstation, dark fantasy, great composition, concept art, highly detailed, dynamic pose, vibrant colours.
+a Japanese modern style luxurious living room, high definition, 8k, intricate and epic concept art, highly detailed, cinematic,
+anonymous as elmo, award winning creature photography, extremely detailed, artstation, 8 k, sensual lighting, incredible art, wlop, artgerm
+portrait painting of man biting woman neck, ultra realistic, concept art, intricate details, eerie, highly detailed, photorealistic, octane render, 8 k, unreal engine. art by artgerm and greg rutkowski and alphonse mucha
+a male half elf in fireproof leather armor wearing a utility belt and goggles, D&D, fantasy, intricate, cinematic lighting, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by Terry Moore and Greg Rutkowski and Alphonse Mucha
+portrait painting of a black muscular bloodied indian middle aged woman in river screaming name of god, sari, ultra realistic, concept art, intricate details, eerie, horror, highly detailed, photorealistic, octane render, 8 k, unreal engine. art by artgerm and greg rutkowski and alphonse mucha
+baroque oil painting full body portrait character concept art, anime key visual of smug young female maid nazi dictator, long straight blonde hair blue eyes, studio lighting delicate features finely detailed perfect face directed gaze, black nazi military uniform, gapmoe kuudere grimdark, trending on pixiv fanbox, painted by greg rutkowski makoto shinkai takashi takeuchi studio ghibli
+symmetry!! 1 3 mm film portrait of bearded man, sci - fi -, cyberpunk, blade runner, glowing lights, tech, biotech, techwear!! intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha, grain, old photograph
+matte painting of a huge swamp, overgrown with lush vines, immaculate scale, greg rutkowski, digital art, trending on artstation, detailed matte painting
+a stunning matte portrait of a thicc and voluptuous vampire dressed as a beautiful poison ivy with hair tied in a braid walking through a flowering garden, greenhouse in the background, dark eyeliner, intricate, elegant, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, art by artgem and jugendstil and greg rutkowski and alphonse mucha, pixv
+portrait of ( ( ( vladimir putin ) ) ) inapocalyptic russia with icecream, hyperrealistic, digital concept art, sharp focus, 3 5 mm film, caricature illustration, art by magic realism, art by josephine wall, art by huang guangjian, art by viktoria gavrilenko, art by amanda sage, trending on artstation
+pointillism painting of a white and caramel beagle dog playing with dragonfly, bright, god rays, dreamy, trending on artstation
+classical oil painting of anime key visual environment concept art of the founding of a nation, trending on artstation, brush strokes, oil, canvas, style of kawacy makoto shinkai jamie wyeth james gilleard edward hopper greg rutkowski, preserved historical
+evil magic steampunk sword concept art, trending on artstation 4k
+hockey game city location with hockey arena, medical building and office buildings. game illustration, gamedev, game, design, mobile game, aerial view, isometric, blizzard, easports, playrix, nexters, intricate, elegant, pixel perfect, sport game, highly detailed, amazing detail, digital painting, trending on artstation, sharp focus, by irina knk, by ann bruhanova, by zze festa, by tatiana gromova, 4 k
+a photorealistic dramatic fantasy render of a beautiful woman alexandra daddario wearing a beautiful intricately detailed japanese monkey kitsune mask and clasical japanese kimono by wlop, artgerm, greg rutkowski, alphonse mucha, epic, beautiful dynamic dramatic dark moody lighting, shadows, cinematic atmosphere, artstation, concept design art, octane render, 8 k
+indistinct man with his hand thrust forward, visible threads of magic link his hand to other people's bodies, he's puppeting them, fantasy, digital art, trending on artstation
+robot pregnant with a human, cozy atmospheric and cinematic lighting, ultra rendered extreme realism and detail 8 k, highly detailed, realistic, refined, bautiful, fine art photography, hyper realistic, in the style of greg rutkowski, by artgerm, by gustave dore, by marco turini, photorealistic, elegant, sharp focus, majestic, award winning picture, intricate, artstation,
+beautiful underwater futuristic city, trending on artstation
+photo of a gorgeous blonde female in cyberpunk city, realistic, sharp focus, 8 k high definition, insanely detailed, intricate, elegant, artgerm, greg kutkowski, high contrast dramatic lighting
+yoda ( 2 0 2 1 ) walking next to groot ( 2 0 1 7 ). they are friends. photorealistic, digital art, epic fantasy, dramatic lighting, cinematic, extremely high detail, cinematic lighting, trending, artstation, cgsociety, 3 d ue 5, 4 k, hq
+portrait of a ruggedly handsome ranger, hands details, muscular, half body, leather, hairy, d & d, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+warhammer 40k, full-lenght portrait of Emperor of Mankind, handsome man in massive gold armor without helmet, beautiful face, long blonde hair, digital art, illustration, fine details, cinematic, highly detailed, octane render, concept art
+illustration of an anime girl being mind controlled, by artgerm and wlop and greg rutkowski, digital art, extreme detail, realistic lighting, cinematic composition, concept art, sharp focus, colorful, photorealistic, 8 k
+mark zuckerberg as an alien, fantasy art, in the style of artgerm, illustration, epic, fantasy, intricate, hyper detailed, artstation, concept art, smooth, sharp focus, ray tracing, vibrant, artgerm, award winning art
+a cloaked cyclops wielding a massive sword, smooth, intricate, elegant, digital painting, artstation, concept art, sharp focus, octane render, illustration, art by hirohiko araki, overwatch character,
+hyperrealistic photography of a highly detailed and symmetrical gorgeous nordic female scientist constructing a birth machine in the style of Jin Kagetsu, James Jean and wlop, highly detailed, masterpiece, award-winning, sharp focus, intricate concept art, ambient lighting, 8k, artstation
+a spaceship flying through space with galaxies in the back, epic lighting, in the art style of arcane, digital art, vector art, trending on artstation, highly detailed
+demonic evil cute fourteen year old south asian girl, tomboy, evil smile, freckles!!!, fully clothed, hypnotic eyes, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha, konstantin razumov, by william - adolphe bouguerea
+ultra realistic illustration, eva green as persephone, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+a highly detailed epic cinematic concept art CG render digital painting artwork: old dead couple at a decayed gas station surrounded by dark figures. By Greg Rutkowski, in the style of Francis Bacon and Syd Mead and Norman Rockwell and Beksinski, open ceiling, highly detailed, painted by Francis Bacon and Edward Hopper, painted by James Gilleard, surrealism, airbrush, Ilya Kuvshinov, WLOP, Stanley Artgerm, very coherent, triadic color scheme, art by Takato Yamamoto and James Jean
+a closeup photorealistic photograph of a cute smiling knitted bernedoodle judge dog dressed in a black gown, presiding over the courthouse. indoors, professional capture, well lit shot. this 4 k hd image is trending on artstation, featured on behance, well - rendered, extra crisp, features intricate detail, epic composition and the style of unreal engine.
+a hyper realistic character concept art of a ((cyberpunk real estate agent)) standing by a (For Sale) sign, half body, front facing camera, 4k rendered in Octane, trending in artstation, cgsociety, 4k post-processing highly detailed by wlop, Junji Murakami, Mucha Klimt, Sharandula, Hiroshi Yoshida, Artgerm, Craig Mullins,dramatic, moody cinematic lighting
+AN 8K RESOLUTION, MATTE PAINTING OF THE WISE AND ANcIENT alien TURTLE, swimming THROUGH a rainbow nebula BY BOB EGGLETON AND MICHAEL WHELAN. TRENDING ON aRTSTATION, hd, highly detailed, vibrant colors, astrophotography, volumetric lighting, dynamic portrait, wide lens, mass effect fan art
+cruising ship sailing at raining night at flooded miniature city, sun is on the rise on the town, cute style garden, octane render, trees, evergreen, patio, garden, wet atmosphere, tender, soft light misty yoshitaka amano, and artgerm
+concept art for a futuristic luxury business class suite in a widebody jet, two aisles, earth tones, digital painting, artstation
+portrait of betty cooper with fluffy bangs, bangs, 1 9 6 0 s, ponytail, curly bangs and ponytail, rounder face, intricate, elegant, glowing lights, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by wlop, mars ravelo and greg rutkowski
+spiky brown very short hair and glasses mage wearing robe, dndbeyond, bright, colourful, realistic, dnd character portrait, full body, pathfinder, pinterest, art by ralph horsley, dnd, rpg, lotr game design fanart by concept art, behance hd, artstation, deviantart, hdr render in unreal engine 5
+Avenida Paulista painted by Greg Rutkowski
+master chief from halo fighting aliens, cinematic composition, epic cinematic lighting, realistic, unreal, highly detailed, 8 k, trending artstation, concept art, sharp focus
+close-up macro portrait of the dark queen, epic angle, epic pose, symmetrical artwork, photorealistic, iridescent, 3d with depth of field, blurred background. cybernetic phoenix bird, translucent dragon, nautilus. energy flows of water and fire, by Tooth Wu and wlop and beeple. a highly detailed epic cinematic concept art CG render digital painting artwork scene. By Greg Rutkowski, Ilya Kuvshinov, WLOP, Stanley Artgerm Lau, Ruan Jia and Fenghua Zhong, trending on ArtStation, made in Maya, Blender and Photoshop, octane render, excellent composition, cinematic dystopian brutalist atmosphere, dynamic dramatic cinematic lighting, aesthetic, very inspirational, arthouse
+sensual beautiful delhi girls wearing western little black dresses at a nightclub, epic scene, by victo ngai, kilian eng vibrant colours, dynamic lighting, digital art, winning award masterpiece, fantastically beautiful, illustration, aesthetically inspired by beksinski and dan mumford, trending on artstation, art by greg rutkowski, 8 k
+amazingly detailed semirealism, anthropomorphic pink rabbit character wearing a bucket hat. Cute, kawaii, Cooky, bt21, Sanrio inspired. Beautiful artwork, Rabbt_character, rabbit_bunny, 獣, iconic character splash art, Detailed fur, detailed textures, 4K high resolution quality artstyle professional artists WLOP, Aztodio, Taejune Kim, Guweiz, Pixiv, Instagram, dribbble, ArtstationHD
+pennywise giving micheal jackson a red balloon in the movie it, by stephen king, highly detailed, 8 k, artstation, cinematic, concept art, smooth, sharp focus, movie scene
+ultra realistic illustration, a full body portrait of deanna troi as death of the endless, the sandman, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+downtown toronto glowing eyes, shamanic poster lsd art, intricate, elegant, highly detailed, centered, digital painting, artstation, concept art, smooth, sharp focus, illustration, artgerm, tomasz alen kopera, peter mohrbacher, donato giancola, joseph christian leyendecker, wlop, frank frazetta
+a frogish kaiju on a desolace planet, legendary epic shot, blade runner, by artgerm, julie bell, beeple and Greg Rutkowski, airbrush, concept art, matte painting, 80s, Smooth gradients, octane render, 8k, High contrast, duo tone, depth of field, volumetric lightning, very coherent artwork
+Dramatic portraiture of Uuen, the Pictish god of stags, mixed media, trending on ArtStation, by and ArtGerm and Lucian Freud, luminism
+incredible beautiful detailed intricate photorealistic painting of a group of friends laughing together. the colors are very vibrant and the people in the photo look very happy. award winning. vibrant colors, funny, personal, positive, visually pleasing, engaging. high resolution. high quality. photorealistic. hq hd. 8 k. trending on artstation. group of friends laughing. award winning
+concept art by greg rutkowski, a very tall, and slender man with short black hair, sitting with the crew in the ship's flight deck, brutalist futuristic interior, dark lighting atmosphere, detailed portraits, nostalgic atmosphere, scifi, digital painting, artstation, concept art, smooth, sharp foccus ilustration, artstation hq
+It's easy to explain 'cause this world's not tame
+owlish empress, D&D, fantasy, portrait, highly detailed, digital painting, trending on artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and magali villeneuve
+epic professional digital art of hungry eyes, eerie atmospheric lighting, painted, intricate, detailed, impressive, leesha hannigan, reyna rochin, wayne barlowe, mark ryden, duncan halleck, best on artstation, cgsociety, wlop, pixiv, stunning, gorgeous, much wow, hdr, 4 k, stunning, gorgeous, cinematic, masterpiece
+incredible, crossing a mindblowingly beautiful rainbow bridge, energy pulsing, matte painting, artstation, solarpunk metropolis, cgsociety, dramatic lighting, vibrant greenery, concept art, octane render, arnold 3 d render
+beautiful woman lying among snakes, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+artwork of a white tiger king with gold crown and blue king suit, concept art, portrait, super detailed, 4 k hd, trending on artstation, digital painted, low contrast, made by greg rutkowski and viktoria gavrilenko
+A Maine forest with cats roaming around beautiful lighting during golden hour. 50mm, f/1.8, Realistic details. Ultra HD. 8K V-ray. Octane Render. Unreal Engine 5. Professionally color graded. Concept art. Vibrant colors. fog. Bokeh
+a comic book poster of divali celebrations by moebius and makoto shinkai and rossdraws, featured on artstation, pixiv, volumetric lighting, 8 k, highly detailed render, soft glow, crisp lines, f 1 1, sharp focus,
+photo of a Dramatic Kathakali male character with traditional headgear painted face wearing futuristic robocop LED goggles and futuristic robot armour with wide traditional ghaghra in the style of stefan kostic, full body, realistic, sharp focus, symmetric, 8k high definition, insanely detailed, intricate, elegant, art by stanley lau and artgerm, Hajime Sorayama, William-Adolphe Bouguereau
+vampire the masquerade, fame of thrones, lord, neon, fibonacci, sweat drops, insane, intricate, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, Unreal Engine 5, 8K, art by artgerm and greg rutkowski and alphonse mucha
+hyperdetailed portrait of a stunningly beautiful pink cyberpunk cute european girl made of metals and shiny iridescent gems, bright rainbow nimbus, gold necklace, smoke background inspired by ross tran and masamune shirow and kuvshinov, intricate, photorealistic, octane render, rtx, hdr, unreal engine, dnd digital art by artgerm
+3 / 4 view of a portrait of woman with flowy hair, bird wings, confident pose, pixie, genshin impact,, intricate, elegant, sharp focus, illustration, highly detailed, concept art, matte, trending on artstation, bright colors, art by wlop and artgerm and greg rutkowski, marvel comics h 6 4 0
+greg manchess portrait painting of a 2 yorha type a no. 2 as overwatch character!! holding a sword!!, white long hair, organic painting, sunny day, matte painting, bold shapes, hard edges, street art, trending on artstation, by huang guangjian and gil elvgren and sachin teng
+dungeons and dragons minotaur character closeup portrait, dramatic light, lake background, 2 0 0 mm focal length, painted by stanley lau, painted by greg rutkowski, painted by stanley artgerm, digital art, trending on artstation
+the eldritch knight as a realistic fantasy knight, closeup portrait art by donato giancola and greg rutkowski, digital art, trending on artstation, symmetry!!
+epic portrait of snufkin, detailed, nebula skies, digital painting, artstation, concept art, donato giancola, joseph christian leyendecker, wlop, boris vallejo, breathtaking, high details, extremely detailed, sincere face, establishing shot, artistic, hyper realistic, beautiful face, octane render
+full body portrait of a korean schoolgirl with long hair and bangs, her hands are thin red tedrils, dramatic lighting, illustration by Greg rutkowski, yoji shinkawa, 4k, digital art, sci-fi horror concept art, trending on artstation
+symmetry!! young nicole kidman, machine parts embedded into face, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha, 8 k
+a child looking at a portal in the hidden garden, scare, environment art, fantasy art, landscape art, in the style of greg rutkowski, illustration, epic, fantasy, intricate, hyper detailed, artstation, concept art, smooth, sharp focus, ray tracing
+nosferatu staying near body of dead woman, scary, dark, misty, at night, 8 k, detailed, concept art, trending on artstation
+polaroid picture, sepia, homeless jon hamm in the streets of los angeles, unshaved, toothless, next to a tent, symmetrical face, fine details, day setting, ethereal, trending on artstation
+anime elvis presley, rockabilly anime illustration, rock'n'roll cartoon, professional drawing, trending on pixiv
+a cute little girl with a round cherubic face, blue eyes, and short wavy light brown hair smiles as she floats in space with stars all around her. she is an astronaut, wearing a space suit. beautiful painting with highly detailed face by artgerm and quentin blake
+Tom Cruise at the king in the desert, beautiful face, fighting in a dark scene, eyes, detailed scene, standing in a heroic figure, Armour and Crown, highly detailed, blood and dust in the air, action scene, cinematic lighting, dramatic lighting, trending on artstation, elegant, intricate, character design, motion and action and tragedy, fantasy, D&D, highly detailed, digital painting, concept art
+portrait of a jamaican fisherman sci - fi glowing fishing armor muscular cyberpunk intricate elegant highly detailed digital painting artstation concept art, ocean background, jamaican colors, greg rutkowski, loish, rhads, ferdinand knab, makoto shinkai and lois van baarle, ilya kuvshinov, rossdraws, tom bagshaw
+an ugly donkey with eyelashes, fantasy art, in the style of artgerm, illustration, epic, fantasy, intricate, hyper detailed, artstation, concept art, smooth, sharp focus, ray tracing, vibrant
+pregnant woman under street light, highly detailed, sharp focused, ultra realistic digital concept art by artgerm
+baroque oil painting full body portrait character concept art, anime key visual of young female black nazi military uniform maid, long flowing platinum blonde hair blue eyes, finely detailed symmetrical perfect face studio lit delicate features directed gaze, gapmoe kuudere grimdark, trending on pixiv fanbox, painted by greg rutkowski makoto shinkai takashi takeuchi studio ghibli
+a award winning half body portrait of a beautiful woman in a croptop and cargo pants with ombre purple pink teal hairstyle with head in motion and hair flying listenin to music on headphones by wlop, paint splatter, outrun, vaporware, shaded flat illustration, digital art, trending on artstation, highly detailed, fine detail, intricate
+draco malfoy, clash royal style characters, unreal engine 5, octane render, detailed, brawl stars, cinematografic, cinema 4 d, artstation trending, high definition, very detailed
+some kittens playing around in a room with yellow background color filled with a fridge. animal cat. digital art. artstation. realistic. vibrant. illustration. in the style of pixar movie. octane render. art by artgerm and greg rutkowski and alphonse mucha. volumetric lighting.
+a pretty smiling blonde girl with heart - shaped sunglasses dressed in pink shiny clothes is walking over water, sun set and skyscrappers in the background, art by guweiz, dramatic lighting, highly detailed, incredible quality, trending on artstation
+cinematic portrait, captin falcon from smash bros, from left, head and chest only, desaturated, tim hildebrandt, wayne barlowe, bruce pennington, donato giancola, larry elmore, oil on canvas, masterpiece, trending on artstation, featured on pixiv, cinematic composition, dramatic pose, beautiful lighting, sharp, details, hyper - detailed, hd, 4 k
+cypher dark souls blood borne fashion photograph, portrait close up, glowing epcot, rei ayanami, final fantasy marlboro, reptile eye of providence, alien brainsucker by karol bak, zdzisław beksinski, daft punk mf boom helmet, kodak portra 4 0 0, 8 k, highly detailed, britt marling style 3 / 4 photographic close, illuminati pyramid, female anime character, druid wizard, giygas organic being, portrait, skeleton, kannon mindar android, sparking beeple, from artstation, anime render, rutkowski of symmetrical art, android wlop, station, very coherent punk, glitchcore, iridescent on greg cyber the cinematic, art, artwork. cinematic, 8 k, unreal albedo accents, art, high hyper epcot, inside realism, hyper wizard very male octane broken hellscape, of mindar detail, greg overlord, artwork, rutkowski colossus, symmetrical key detail, coherent trending japan, artwork, space hornwort, artwork. abstract, druid druid, artstation, futurescape, on render, shadows robot, glitch forest organic, character, spell, render, key octane render, accents a concept library casting iridescent abstract. by octane intricate realism, octane dan from intricate mask, trending intricate intricate high render, art, gems, mumford. wu, tooth engine cannon beeple, 8 k, a oni
+beautiful black girl magic, nature goddess with brown skin in front of nebulae bursting halos, crisp digital painting by artgerm by mucha by caravaggio and face by wlop
+goth anime clown in mini skirt and crop top intricate, extremely detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, intimidating lighting, incredible art, face and body
+Twin Peaks, of Michael Shannon the mechanic discovering a man dressed as a Furry in the woods, mysterious creepy, poster artwork by Michael Whelan, Bob Larkin and Tomer Hanuka, from scene from Twin Peaks, simple illustration, domestic, nostalgic, from scene from Twin Peaks, clean, full of details, by Makoto Shinkai and thomas kinkade, Matte painting, trending on artstation and unreal engine, super clean, fine detail, cell shaded,
+realistic character concept, japanese queen with lots of jewelry in the face, elegant pose, scifi, illustration, symmetrical, artstation, cinematic lighting, hyperdetailed, cgsociety, 8 k, high resolution, charlie bowater, tom bagshaw, single face, insanely detailed and intricate, beautiful, elegant, golden ratio, dark fractal background, vfx, postprocessing, soft lighting colors scheme, fine art photography, hyper realistic, photo realistic
+magic : the gathering fantasy character concept art of a ball of rice with a menacing facial expression, by frank frazetta and marco bucci, high resolution. dark fantasy forest in the background, fantasy coloring, intricate, digital painting, artstation, smooth, sharp focus
+pregnant woman under street light, highly detailed, sharp focused, ultra realistic digital concept art by Alyssa Monks, Ruan Jia, Stanley Artgerm
+a grim dark fantasy town seen from the gutters, dnd encounter, dark fantasy, rain, atmospheric lighting, extremely detailed, no people, photorealistic, octane render, 8 k, unreal engine 5. art by artgerm and greg rutkowski and alphonse mucha
+mf doom with reptile eyes, fallout power armor exploding into fractals, intricate, elegant, highly detailed, centered, digital painting, artstation, concept art, smooth, sharp focus, illustration, artgerm, tomasz alen kopera, peter mohrbacher, donato giancola, joseph christian leyendecker, wlop, frank frazetta
+Very very very very highly detailed epic central composition portrait of face with venetian mask, golden, intricate, dystopian, sci-fi, extremely detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, intimidating lighting, incredible art by Tokujin Yoshioka and Anton Pieck
+Michael Fassbender in white armor, intricate, epic lighting, hyper realistic, white short hair, character concept art, cinematic, artgerm, artstation trending.
+a hyper - realistic character concept art portrait of a computer man, depth of field background, artstation, award - winning realistic sci - fi concept art by jim burns and greg rutkowski, beksinski, a realism masterpiece, flesh - tone color palette, james gilleard, bruegel, alphonse mucha, and yoshitaka amano.
+tundra, digital art, concept art, magic fantasy, vibrant colors, high contrast, highly detailed, trending on artstation, 8k, andreas rocha, sylvain sarrailh, darek zabrocki, finnian macmanus, dylan cole, liang mark, albert bierstadt, sung choi, peter mohrbacher, greg rutkowski, studio ghibli
+beautiful full body Emma Watson smiling, art by lois van baarle and loish and ross tran and rossdraws and sam yang and samdoesarts and artgerm, digital art, highly detailed, intricate, sharp focus, Trending on Artstation HQ, deviantart, unreal engine 5, 4K UHD image
+a stunning GTA V loading screen with a beautiful woman with ombre hairstyle in purple and pink blowing in the wind, city streets, golden ratio, digital art, trending on artstation
+A cyberpunk cyborg girl with big and cute eyes, fine-face, realistic shaded perfect face, fine details. not anime. Realistic shaded lighting poster by Ilya Kuvshinov katsuhiro, magali villeneuve, artgerm, Jeremy Lipkin and Michael Garmash, Rob Rey and Kentarõ Miura style, trending on art station
+peaceful elven forest, thick forest filled with elven warriors, by alan lee, michal karcz, smooth details, lord of the rings, game of thrones, smooth, detailed terrain, oil painting, trending artstation, concept art, fantasy matte painting
+a lisa frank fashion model mcdonalds princess microwaved super deluxe big mac happymeal with diet coke and a large order of fries, gothic, highly detailed, digital painting, artstation, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha and william - adolphe bouguereau
+collie as odin, intricate, elegant, highly detailed, digital painting, artstation, concept art, matte, illustration, hearthstone, art by artgerm and greg rutkowski and alphonse mucha, simon stalenhag, hyperreal
+a beautiful daft punk humanoids with freckled cheeks, cyber neon lighting, futurism, intricate futuristic jewelry accessories, cyberpunk glossy white latex swimsuit, profile posing, hyper photorealistic, crispy quality, digital photography, trending in artstation, trending in pinterest, cinematic, 4 k ultra hd, art by pascal blanche, art by greg rutkowski,
+portrait sci-fi art by Ruan Jia and Raymon Swanland, a glowing alien neon glass orb floating above the hand of a soldier, solar flares, detailed and intricate futuristic environment, cyberpunk, neon color bioluminescence, transparent reflective metal, dramatic lighting, cinematic, high technology, highly detailed portrait, digital painting, artstation, concept art, smooth, sharp focus, illustration, Artstation HQ
+Rose Gold intricate lace smoke portrait, geometric watercolor art by peter mohrbacher and artgerm, radiant halo of light
+skinny male fantasy alchemist, long dark hair, 1 9 th century, elegant, highly detailed, intricate, smooth, sharp focus, artstation, digital paining, concept art, art by donato giancola, greg rutkowski, artgerm, cedric peyravernay, valentina remenar, craig mullins
+cute friendly shrine maiden by charlie bowater and titian and artgerm, intricate, face, japanese shrine, elegant, pink mist, beautiful, highly detailed, dramatic lighting, sharp focus, trending on artstation, artstationhd, artstationhq, unreal engine, 4 k, 8 k
+cute fisherman tom daley, natural lighting, path traced, highly detailed, high quality, digital painting, by don bluth and ross tran and studio ghibli and alphonse mucha, artgerm
+Boris Johnson as Jack Sparrow, Boris Johnson hairstyle, realistic portrait, symmetrical, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha
+billionaire's yacht adopted as a vacation spot for coal miners a Mandelbrot fractal by Craig Mullins, ilya kuvshinov, krenz cushart, artgerm trending on artstation by Edward Hopper and Dan Mumford and WLOP and Rutkovsky, Unreal Engine 5, Lumen, Nanite
+hisoka, young tom hiddleston, cel - shaded animesque art by artgerm and greg rutkowski and alphonse mucha, smooth white skin, smirking face, reddish hair, d & d, fantasy, feminine portrait, highly detailed, digital painting, trending on artstation, concept art, sharp focus, illustration
+The eye of cthulu from Terraria, 3d render trending on artstation
+photographic portrait of a widow, highly detailed, digital painting, Trending on artstation , HD quality, by artgerm and greg rutkowski and alphonse mucha, dramatic light, octane
+portrait of megan fox as pinhead, bald, hellraiser, hell, intricate, headshot, highly detailed, digital painting, artstation, concept art, sharp focus, cinematic lighting, illustration, art by artgerm and greg rutkowski, alphonse mucha, cgsociety
+lady assassin wearing cyberpunk streetwear, cybernetic legs, detailed portrait, 4 k, vivid colours, concept art by wlop, ilya kuvshinov, artgerm, krenz cushart, greg rutkowski, pixiv. cinematic dramatic atmosphere, sharp focus, volumetric lighting, cinematic lighting, studio quality
+a monk meditating, in the style of tomasz alen kopera and fenghua zhong and peter mohrbacher, mystical colors, rim light, beautiful lighting, 8 k, stunning scene, raytracing, octane, trending on artstation
+goddess of war, accurate anatomy, IFBB fitness body, only two hands, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, Unreal Engine 5, 8K, art by art by artgerm and greg rutkowski and edgar maxence
+a portrait of a beautiful cybernetic woman meditating in lotus pose, wires, cyberpunk concept art by josan gonzales and philippe druillet and dan mumford and enki bilal and jean claude meziere
+symmetry!! portrait of mark zuckerberg, hairless!!, fantasy, medieval wear, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+professional concept art portrait of a masked diesel punk man in a dark room by artgerm and greg rutkowski ( thin white border ). an intricate, elegant, highly detailed digital painting, concept art, smooth, sharp focus, illustration, in the style of cam sykes, wayne barlowe, igor kieryluk.
+margot robbie, manga cover art, detailed color portrait, artstation trending, 8 k, greg rutkowski
+a portrait of an anthropomorphic cyberpunk mouse holding a can of beer, cyberpunk!, fantasy, elegant, digital painting, artstation, concept art, matte, sharp focus, illustration, art by josan gonzalez
+skeleton man walking forward with explosion behind him, science fiction industrial hard science concept art, 8K render octane high definition cgsociety, photorealistic, unreal engine
+a cloaked adventure standing in a winding road, gas street lamps. Country road, country landscape, fields, fields, the ruins of one small barn, wide view, desolate. digital illustration, very vibrant colors, soft lighting, adventurous, atmospheric lighting, 8K, octane render. By Makoto Shinkai, Stanley Artgerm Lau, WLOP, Rossdraws, James Jean, Andrei Riabovitchev, Marc Simonetti, krenz cushart, Sakimichan, D&D trending on ArtStation, digital art.
+vibrant colorful vaporwave geometry symmetry bauhaus poster, etching by gustave dore, intricate, sharp focus, illustration, highly detailed, digital painting, concept art, masterpiece
+Abandoned medieval castle, art by Quentin Mabille , trending on artstation, artstationHD, artstationHQ, 4k, 8k
+Boris Johnson as Wolverine, portrait, X man costume, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha
+Mikasa Ackerman, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, art by greg rutkowski and alphonse mucha
+lateral portrait of samurai, sci - fi, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+Aly Michalka as a stunning , beautiful retro SCI-FI space heroine 1985 , movie poster, intricate, elegant, highly detailed, centered, digital painting, trending on artstation, concept art, smooth, sharp focus, illustration, art by raphael lacoste ,eddie mendoza ,alex ross, WLOP
+a visual representation of a place evoked by the song titles of the album kryptos by andreas vollenweider, photorealistic and intricate concept art, 8 k hdr, cinematic lighting
+fantasy girl mage in a forest, dramatic fantasy art, by yoshitaka amano, trending on artstation, 4 k, expressive oil painting, close - up face portrait, vivid colors
+a portrait of a finely detailed beautiful!!! feminine cyberpunk ghost rider with skull face and long flowing hair made of fire and flames, dressed in black leather, by Alphonse Mucha, designed by H.R. Giger, legendary masterpiece, stunning!, saturated colors, black background, trending on ArtStation
+tattoo design, stencil, stencil on paper, tattoo stencil, traditional, beautiful portrait of a traditional Japanese girl with flowers in her hair, upper body, by artgerm, artgerm, artgerm, digital art, cat girl, anime eyes, anime, sexy, super model-s 100
+portrait of a young very beautiful cute tribal woman with a steampunk gun, in a post apocalyptic city overgrown with lush vegetation, by Luis Royo, by Greg Rutkowski, dark, gritty, intricate, head space, volumetric lighting, volumetric atmosphere, concept art, cover illustration, octane render, trending on artstation, 8k
+a young attractive Asian woman in the pilot's seat of a massive sci-fi mecha, dramatic pose, LEDs, highly detailed, photorealistic, volumetric lighting, digital art, octane render, in the style of Artgerm and Tom Bagshaw
+wolf warrior in red cape and hood, d & d, fantasy, portrait, highly detailed, headshot, digital painting, trending on artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and magali villeneuve
+a beautiful young charming asian goddess with sundress and jewelry | | winter, realistic shaded, unpleasant face, good looking, fine details, dior, lv, realistic shaded lighting poster by greg rutkowski, macoto takahashi, magali villeneuve, artgerm, jeremy lipkin and michael garmash
+hyperrealistic portrait of a woman monster astronaut, full body portrait, well lit, intricate abstract. cyberpunk, intricate artwork, by Tooth Wu, wlop, beeple. octane render,in the style of Jin Kagetsu, James Jean and wlop, highly detailed, sharp focus, intricate concept art, digital painting, ambient lighting, 4k, artstation
+tracer overwatch portrait, close up, concept art, intricate details, highly detailed photorealistic portrait by michael komarck, joel torres, seseon yoon, artgerm and warren louw
+a grim reaper with a crt monitor for a head. the monitor has a blue screen with white letters on it. by frank frazetta, simon bisley, brom, concept art, octane render, unreal engine 5, highly detailed, high quality, 8 k, soft lighting, realistic face, path traced
+blender gloomy colossal ruined server room in datacenter robot figure automata headless drone robot knight welder posing pacing fixing soldering mono sharp focus, emitting diodes, smoke, artillery, sparks, racks, system unit, motherboard, by pascal blanche rutkowski artstation hyperrealism cinematic dramatic painting concept art of detailed character design matte painting
+a photograph of a robot endoskeleton submerged and rusted in the water, cinematic, volumetric lighting, f 8 aperture, cinematic eastman 5 3 8 4 film, photorealistic by greg rutkowski, by stanley artgerm, by alphonse mucha
+hyper detailed ultra sharp, trending on artstation, vibrant aesthetic, bloodwave, colorful, psychedelic, ornate, intricate, digital painting, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and h. r. giger, 8 k
+gothic bell tower, view from above. in style of greg rutkowski, jesper ejsing, makoto shinkai, trending on artstation, fantasy, great composition, concept art, highly detailed, scenery, 8 k, behance.
+ned kelly, extremely detailed, artstation, 8 k, sensual lighting, incredible art, wlop, artgerm
+a girl in times square new york, very sexy outfit, very anime, medium shot, visible face, detailed face, perfectly shaded, atmospheric lighting, by makoto shinkai, stanley artgerm lau, wlop, rossdraws
+full-body baroque and cyberpunk glass sculpture of attractive muscular iridescent Nick Jonas as a humanoid deity wearing a thin see-through plastic hooded cloak sim roupa, posing like a superhero, glowing pink face, crown of white lasers, large diamonds, swirling black silk fabric. futuristic elements. oozing glowing liquid, full-length view. space robots. human skulls. throne made of bones, intricate artwork by caravaggio. Trending on artstation, octane render, cinematic lighting from the right, hyper realism, octane render, 8k, depth of field, 3D
+breathtaking detailed soft painting of silver hours of sun, caresses on pepper plains, the hand of the country on my shoulder, rembrandt style, elegant, highly detailed, artstation, concept art, matte, sharp focus, art by tom bagshaw, and greg rutkowski
+Emma Watson as a dune princess, sci-fi, amber eyes, face, long hair, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+game of thrones, masterpiece, pinup, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, Unreal Engine 5, 8K
+one beautiful symmetrical close up head shoulder face portrait android woman time machine axonometric mechanical fantasy intricate elegant highly detailed in volumetric void of latent space, golden turquoise steampunk, axonometric high contrast cinematic light, mystical shadows, digital painting, smooth, sharp focus, divine realm of gods, octane render, photographic, concept art, artist leonardo davinci, unreal engine 8 k
+arnold schwarzenegger surfing inside erupting volcano, stunning scene, 8 k, extremely detailed digital painting, depth, bright colors, trending on artstation
+a photorealistic dramatic fantasy render of a beautiful woman billie eilish wearing a beautiful intricately detailed japanese monkey kitsune mask and clasical japanese kimono by wlop, artgerm, greg rutkowski, alphonse mucha, epic, beautiful dynamic dramatic dark moody lighting, shadows, cinematic atmosphere, artstation, concept design art, octane render, 8 k
+Portrait of a space astronaut monkey, fantasy, intricate, highly detailed, digital painting, trending on artstation, sharp focus, illustration, style of Stanley Artgerm
+goddess of death, braids, decaying face, neon hair, intricate illuminated jewellery, digital painting, surrealism, extreme detail, cinematic lighting, trending on artstation, by hans zatzka
+a zombie teenager staring at their phone, tristan eaton, victo ngai, artgerm, rhads, ross draws
+realistic detailed face portrait of a rugged male wizard with black hair wearing a hooded cloak by alphonse mucha, ayami kojima, amano, greg hildebrandt, and mark brooks, male, masculine, art nouveau, neo - gothic, gothic, character concept design
+a shadowy figure in tattered robes sees another figure in the distance, in an alien desert during a sandstorm ; tension, creepy mood, uneasy atmosphere, weird fiction art, breathtaking digital illustration, cinematic lighting, striking perspective, aesthetic composition, trending on artstation
+an epic painting minion looking like elon musk presenting new tesla, pencil drawing, perfect composition, golden ratio, beautiful detailed, photorealistic, digital painting, concept art, smooth, sharp focus, illustration, artstation trending, octane render, unreal engine
+Hedgehog magus, Tzeentch, portrait, nature, fairy, forest background, magic the gathering artwork, D&D, fantasy, cinematic lighting, centered, symmetrical, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, volumetric lighting, epic Composition, 8k, art by Akihiko Yoshida and Greg Rutkowski and Craig Mullins, oil painting, cgsociety
+mermaid emma watson, perfectly-centered-painting of emma watson, sweaty, dynamic action pose, insane, intricate, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, Unreal Engine 5, 8K, art by artgerm and greg rutkowski and alphonse mucha
+painting of hybrid between cat & dragon & snake & fox, intercrossed animal, by zdzislaw beksinski, by lewis jones, by mattias adolfsson, cold hue's, warm tone gradient background, concept art, beautiful composition, digital painting
+character portrait of a raven angel of night with iridescent black raven wings wearing robes, lord of change, by peter mohrbacher, mark brooks, jim burns, marina abramovic, wadim kashin, greg rutkowski, trending on artstation
+girl sitting on a stair under a vine rack, many green plant and flower gowing on it, illustration concept art anime key visual trending pixiv fanbox by wlop and greg rutkowski and makoto shinkai and studio ghibli
+giant skeletal ghoul devouring a mountain of skulls, digital painting, mixed media, trending on artstation and deviantart, epic composition, highly detailed, 8 k
+portrait of jean baudrillard, soft hair, muscular, half body, leather, d & d, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+hacker girl sits at an apple ] [ e, realistic shaded lighting poster by ilya kuvshinov katsuhiro otomo, magali villeneuve, artgerm, jeremy lipkin and michael garmash and rob rey
+movie still macro close photo of koala selling nft, by weta disney pixar greg rutkowski wlop ilya kuvshinov rossdraws artgerm octane render iridescent, bright morning, liosh, mucha
+a coffee shop store in The City of Ukraine at night with a few customers, extreme plus resolution fantasy concept art, intricate details to everything visible, sharp lighting, Dramatic light by denis villeneuve, strong emphasis on alphonse mucha, Makoto Shinkai
+the interior of a store that sells board games and sushi, intricate, digital painting, masterpiece, rending on artstation, octane render, art by artgerm and greg rutkowski and alphonse mucha and craig mullins and James Jean and Andrei Riabovitchev and Marc Simonetti and peter mohrbacher
+danny devito as wolverine, oil on canvas portrait, octane render, trending on artstation
+portrait painting of male evil demonic cult member, agony, ultra realistic, concept art, intricate details, eerie, highly detailed, photorealistic, octane render, 8 k, unreal engine. art by artgerm and greg rutkowski and alphonse mucha
+a snoop dogg wearing sun glasses tennis ball monster, snoop dogg tennis ball head, smoking, smoke, monster teeth, colorful, chalk digital art, fantasy, magic, chalk, trending on artstation, ultra detailed, professional illustration by basil gogos
+dusk land dark city filled with shadow people, desolate, gloomy, intricate, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski
+portrait of a beautiful woman wearing a sari dress, holding a bouquet of flowing flowers, drenched body, wet dripping hair, emerging from the water, fantasy, regal, fractal crystal, fractal gems, by stanley artgerm lau, greg rutkowski, thomas kindkade, alphonse mucha, loish, norman rockwell
+astronaut drifting in space, artwork by greg rutkowski
+book cover!!!!!!!!!!!!, old bridge, fantasy forest landscape, fantasy magic, light night, intricate, elegant, sharp focus, illustration, highly detailed, digital painting, concept art, matte, art by wlop and artgerm and ivan shishkin and andrey shishkin, masterpiece
+a beautiful hyperrealistic detailed 3D render of a burning monument, by Anton Otto Fischer, Atey Ghailan, genzoman, unreal engine, octane render, gigantic, 3D, brilliantly coloured, intricate, ultra wide angle, trending on artstation, embers, smoke, dust, dusk, volumetric lighting, HDR, polished, micro details, ray tracing, 8k
+close-up macro portrait of the face of a beautiful princess with ram skull mask, epic angle and pose, symmetrical artwork, 3d with depth of field, blurred background, cybernetic jellyfish female face skull phoenix bird, translucent, nautilus, energy flows of water and fire. a highly detailed epic cinematic concept art CG render. made in Maya, Blender and Photoshop, octane render, excellent composition, cinematic dystopian brutalist atmosphere, dynamic dramatic cinematic lighting, aesthetic, very inspirational, arthouse. y Greg Rutkowski, Ilya Kuvshinov, WLOP, Stanley Artgerm Lau, Ruan Jia and Fenghua Zhong
+a super realistic dragon that is on fire standing dramatically on a destroyed city, ultrawide shot, surreal, sharp focus, digital art, epic composition, concept art, dynamic lighting, intricate, highly detailed, 8 k, unreal engine, blender render
+man in suit launching the nukes, matte painting concept art, baroque, beautifully backlit, swirly vibrant color lines, fantastically gaudy, aesthetic octane render, 8 k hd resolution, by caravaggio and diego velazquez
+an extremely psychedelic portrait of SalvadorDali, by Raphael Hopper, and Rene Magritte. Extremely Highly detailed, Occult, funny, humorous, humor, hilarious, funny, entertaining, magical, trending on artstationHQ, LSD, diffuse lighting, fantasy, intricate, elegant, highly detailed, lifelike, photorealistic, digital painting, artstation, illustration, concept art, smooth, sharp focus, art by John Collier and Albert Aublet and Krenz Cushart and Artem Demura and Alphonse Mucha and Giuseppe Arcimboldo
+inside an etheral atompunk city, highly detailed, 4k, HDR, award-winning, octane render, trending on artstation, volumetric lighting
+subspace emissary, jungle groove, constellation - based cathedral, octane render, trending on artstation, ray - tracing, subsurface scattering, 4 k, high quality desktop wallpaper
+a dream of being trapped underwater, thalassophobia, fear of the ocean, open water, imagination, dream, concept art, trending on artstation, highly detailed
+an anime portait shogun knight with a lightsaber halberd, dark metal armor, and a tattered cape, by stanley artgerm lau, wlop, rossdraws, james jean, andrei riabovitchev, marc simonetti, and sakimichan, trending on artstation
+postmodern zakopane designed by louis sullivan, still from a movie, photo art, artgerm, trending on artstation
+a beautiful action portrait of a handsome DnD-ranger hunting in a forest, face is brightly lit, by Greg Rutkowski and Raymond Swanland, Trending on Artstation, ultra realistic digital art
+jim carrey, portrait shinkai makoto studio ghibli studio key hideaki anno sakimichan stanley artgerm lau rossdraws james jean marc simonetti elegant highly detailed digital painting artstation pixiv
+a man tied to a pillar by jack russel terrier, highly detailed, hyperrealistic digital painting, artstation, concept art, smooth, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha
+a professional painting of an russian young blonde girl intricate, wearing russian ancient folk dress, elegant, digital painting, concept art, smooth, sharp focus, finely detailed illustration, beautifully framed, from Metal Gear, in the style of Artgerm and Greg Rutkowski and William-Adolphe Bouguerea
+soaring woman wearing a round mask hiding her face with many thick long blades behind head. dressed in a long robe with wide sleeves and making anjali mudra gesture. highly detailed, symmetric, concept art, saturated colors, masterpiece, fantasy art, hyperdetailed, hyperrealism, art by zdzisław beksinski, arthur rackham, dariusz zawadzki, larry elmore
+ancient queen billie eilish, symetrical, diffuse lighting, fantasy, intricate, elegant, highly detailed, lifelike, photorealistic, digital painting, artstation, illustration, concept art, 4 k, smooth, sharp focus, art by john collier and albert aublet and krenz cushart and artem demura and alphonse mucha
+a gorgeous kanye west photo, professionally retouched, soft lighting, realistic, smooth face, full body shot, torso, perfect eyes, wide angle, sharp focus on eyes, 8 k high definition, insanely detailed, intricate, elegant, art by artgerm and jason chan and mark litvokin
+a bear and a bunny chimera with the size and strength of a bear, The white color and long bunny ears of a bunny and golden brown antlers. Concept art. Fantasy. Trending on artstation. Masterpiece. By Karlkka. By Greg Rutkowski James Gurney
+beautiful anime girl with short white hair, wearing lab coat and glasses, holding a clipboard, standing inside a research facility, character portrait, 1 9 6 0 s, long hair, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by wlop, charlie bowater and alexandra fomina
+Manga cover portrait of an extremely cute and adorable beautiful curious happy puppy smelling a flower, summer vibrance, 3d render diorama by Hayao Miyazaki, official Studio Ghibli still, color graflex macro photograph, Pixiv, DAZ Studio 3D
+concept art of fried egg, highly detailed painting by dustin nguyen, akihiko yoshida, greg tocchini, greg rutkowski, cliff chiang, 4 k resolution, trending on artstation, 8 k
+Bob Dylan design, character sheet, Kim Jung Gi, Greg Rutkowski, Zabrocki, Karlkka, Jayison Devadas, Phuoc Quan, trending on Artstation, 8K, ultra wide angle, zenith view, pincushion lens effect
+dichroic ant axolotl snail bug bee fly worm caterpillar fish, (((artstation, concept art, smooth, sharp focus, artgerm, Tomasz Alen Kopera, Peter Mohrbacher, donato giancola, Joseph Christian Leyendecker, WLOP, Boris Vallejo))), octane render, unreal engine, 3d render, , octane render, nvidia raytracing demo, grainy, muted
+sojourn from overwatch, african canadian, gray hair, character portrait, portrait, close up, concept art, intricate details, highly detailed, vintage sci - fi poster, retro future, vintage sci - fi art, in the style of chris foss, rodger dean, moebius, michael whelan, and gustave dore
+open treasure chest with the greatest riches on earth, deep focus, d & d, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, hearthstone, art by artgerm and greg rutkowski and alphonse mucha
+hunter woman walking across foggy river, unreal engine 5, art by artgerm and greg rutkowski and alphonse mucha, global illumination, detailed and intricate environment, hyperrealistic, volumetric lighting, epic cinematic shot, perfectly defined features, ambient occlusion
+psychedelic ; trippy ; acid trip ; artgerm ; salvadore dali ; surreal ; abstract ; lsd ; jesus christ ; ascension ; symmetrical ; mathematical
+girl floating on the night sky, gaint planet in the background, illustration concept art anime key visual trending pixiv fanbox by wlop and greg rutkowski and makoto shinkai and studio ghibli
+a strange alien fruit, photorealistic, 8 k, professional food photography, volumetric lighting, trending on artstation
+painting of hybrid between bear & snake, animal has snake body, intercrossed animal, by zdzislaw beksinski, by lewis jones, by mattias adolfsson, cold hue's, warm tone gradient background, concept art, beautiful composition, digital painting
+a professional photographic view picture of a alley in space, photographic filter unreal engine 5 realistic hyperdetailed 8 k ultradetail cinematic concept art volumetric lighting, very beautiful scenery, very realistic effect, hd, hdr, cinematic 4 k wallpaper, 8 k, sharp focus, octane render, ultra detailed, high resolution, artstation trending on artstation in the style of albert dros glowing rich colors powerful imagery
+mulan, d & d, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, hearthstone, art by artgerm and greg rutkowski and alphonse mucha
+'' Portrait of Beautiful blonde Slavic woman in her early 30�s, league of legends, LOL, fantasy, d&d, digital painting, artstation, concept art, sharp focus, illustration, art by greg rutkowski and alphonse mucha ''
+beautiful cottagecore kim kardashian holding a adidas yeezy shoe. intricate, elegant. highly detailed, digital painting, artstation, concept art, smooth, sharp, focus, illustration. . art by artgerm and greg rutkowski and alphonse mucha
+symmetry, multiple humans in solid silhouettes, saluting, dancing, interacting and posing, mooc, organic and intricate, elegant, highly detailed, concept art, sharp focus, illustration, high contrast, long shadows, painted with colour on white, 8 k
+cinematic bust portrait of psychedelic cyborg, head and chest only, exotic alien features, Tim Hildebrandt, Wayne Barlowe, Bruce Pennington, donato giancola, larry elmore, oil on canvas, masterpiece, trending on artstation, featured on pixiv, cinematic composition, dramatic pose, beautiful lighting, sharp, details, hyper-detailed, HD, HDR, 4K, 8K
+a highly detailed illustration of cute smug pink haired pale girl with curved horns wearing oversized pink hoodie, dramatic smirk pose, intricate, elegant, highly detailed, centered, soft light, character design, cushart krenz, digital painting, artstation, concept art, smooth, sharp focus, league of legends concept art, wlop.
+portrait of a young mila kunis in front of a cyberpunk city, dramatic light, city background, sunset, high contrast, sharp, painted by stanley lau, painted by greg rutkowski, painted by stanley artgerm, digital art, trending on artstation
+portrait close up of guy, concentrated look, symmetry, long hair. d & d, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, art by artgerm and greg rutkowski and alphonse mucha, boris vallejo
+ law contrasts, fantasy concept art by Jakub Rozalski, Jan Matejko, and J.Dickenson
+professional concept art ethereal ghostlike valkyrie figure fluid simulation in houdini dancing in dark smoke robes and silk veils by ilm, paolo roversi, nick knight, amy judd, beautiful simplified form in turbulent movement, dark studio background, turner, romantic, trending on artstation, hyperrealism, matte painting, dutch golden age, fine detail, cgsociety
+portrait Anime batman cosplay girl cute-fine-face, pretty face, realistic shaded Perfect face, fine details. Anime. realistic shaded lighting by katsuhiro otomo ghost-in-the-shell, magali villeneuve, artgerm, rutkowski Jeremy Lipkin and Giuseppe Dangelico Pino and Michael Garmash and Rob Rey
+hyperrealism, detailed textures, photorealistic 3 d, a young boy walking down the street holding a worn out teddy bear, ultra realistic, cinematic, intricate, cinematic light, concept art, illustration, art station, unreal engine 8 k
+nuclear power plant, colorful, sci-fi, clean, utopia, surrounded by wilderness, sunset, octane render, substance painter, zbrush, trending on artstation, 8K, highly detailed.
+Defect from Slay the Spire, concept art, by Odilon Redon
+insanely detailed procedural render expressive scene of chrome spacesuits protecting the dancing nudibranch girl from certain doom as the planet they orbit sends spores attack them, photorealism, sharp focus, award winning, tristan eaton, victo ngai,, maxfield parrish, artgerm, koons, ryden, intricate details, 3 / 4 view, bokeh
+portrait art of Gene Kelly 8k ultra realistic , lens flare, atmosphere, glow, detailed,intricate, full of colour, cinematic lighting, trending on artstation, 4k, hyperrealistic, focused, extreme details,unreal engine 5, cinematic, masterpiece
+portrait painting of a post - apocalyptic bald androgynous teenager with white eyes and a green aura around his head, ultra realistic, concept art, intricate details, eerie, highly detailed, photorealistic, octane render, 8 k, unreal engine. art by artgerm and greg rutkowski and charlie bowater and magali villeneuve and alphonse mucha
+ancient neon monster portrait, intricate artwork by josan gonzalez, artgerm, h. r. giger, kilian eng, very coherent artwork, cinematic, hyper realism, vibrant, octane render, unreal engine, 8 k, high contrast, higly detailed black ink outline
+twin peaks poster art, portrait of the black lodge has the blue colored rose trapped in a glass box, can david bowie find it, by michael whelan, rossetti bouguereau, artgerm, retro, nostalgic, old fashioned
+the beautiful hyper detailed scene render that a beautiful girl lies in the arms of a huge silver dragon alone in the fairyland surrounded by white clouds, in the style of makoto shinkai victo ngai and peter mohrbacher studio ghibli artgerm karol bak beeple, animation style, 8 k hd, dream, ultra wide angle, animation style, 3 drender, hyperdetailed
+portrait of a beautiful young fit male angel with curly blond hairs, dressed with fluent clothes, luminous scene, by Greg Rutkowski and alphonse mucha, d&d character, gradient white to cyan, in front of an iridescent background, highly detailed portrait,
+a scene of a camper in the desert, a cowboy in the foreground looking epic, full shot, atmospheric lighting, detailed faces, by makoto shinkai, stanley artgerm lau, wlop, rossdraws
+lalisa manoban of blackpink, knight armor, tarot card, highly detailed, digital painting, smooth, sharp focus, illustration, ultra realistic, 8 k, art by artgerm and alphonse mucha
+feudal japan tokyo street at dusk, raining, detailed reflections, on a postcard, cinematic lighting!!, 4k, trending on artstation, detailed watercolour, rule of thirds, center focus, art by albert bierstadt
+concept art by greg rutkowski, a gigantic spear - shaped starship approaches the system, huge and megalithic, plowing through space, frightening and creepy atmosphere, scifi, digital painting, artstation, concept art, smooth, sharp foccus ilustration, artstation hq
+beautiful girl a strange wind blew in off the north sea, an eerie susurration that cut across the eastern sea, beautiful portrait, symmetrical, character concept style trending on artstation concept art detailed octane render cinematic photo - realistic 8 k high detailed
+the street of a frozen village in ice that never the see the sun again, concept art by makoto shinkai and greg rutkowski, matte painting, trending on artstation
+full length photo of a gorgeous young woman in the style of stefan kostic, realistic, sharp focus, 8k high definition, insanely detailed, intricate, elegant, art by stanley lau and artgerm
+beautiful sci fi space scene with planets, concept art trending on artstation, volumetric lighting, 8k
+brigitte from overwatch, character portrait, portrait, close up, concept art, intricate details, highly detailed, vintage sci - fi poster, retro future, vintage sci - fi art, in the style of chris foss, rodger dean, moebius, michael whelan, and gustave dore
+will smith fights against demons dressed as a gladiator and with angel wings, cinematic lighting, highly detailed, concept art, art by wlop and artgerm and greg rutkowski, masterpiece, trending on artstation, 8 k
+Till Lindemann crushing planet earth with his teeth. epic game portrait. Highly detailed, highly recommended. fantasy art by Greg Rutkowski
+Art nouveau Ferarri, fantasy, intricate galactic designs, elegant, highly detailed, sharp focus, art by Artgerm and Greg Rutkowski and WLOP
+walter white as lara croft, digital painting, extremely detailed, 4 k, intricate, brush strokes, mark arian, artgerm, bastien lecouffe - deharme
+a swamp viewed from afar with one huge tree in the middle, dark colors, glowing plants, misty background, light rays, sunset!, birds, beautiful lighting, vivid colors, intricate, elegant, smooth, sharp focus, highly detailed digital painting, concept art, cinematic, unreal engine, 4 k wallpaper, svetlin velinov, tarmo juhola, artstation trending
+wide angle, mage, sleeping on rock, white grey blue color palette, eyes closed, forest, female, d & d, fantasy, intricate, elegant, highly detailed, long brown hair, digital painting, artstation, octane render, concept art, matte, sharp focus, illustration, hearthstone, art by artgerm, alphonse mucha johannes voss
+cinematic portrait of the incredible hulk, only head and chest, intricate, desaturated, Tim Hildebrandt, Wayne Barlowe, Bruce Pennington, donato giancola, larry elmore, maxfield parrish, Moebius, Thomas Ehretsmann, oil on canvas, gouache painting, masterpiece, trending on artstation, cinematic composition, dramatic pose, volumetric lighting, sharp, details, hyper-detailed, HD, 4K, 8K
+cinematic bust portrait of futuristic robot from left, head and chest only, exotic alien features, robotic enhancements, desaturated, tim hildebrandt, wayne barlowe, bruce pennington, donato giancola, larry elmore, oil on canvas, masterpiece, trending on artstation, featured on pixiv, cinematic composition, dramatic pose, beautiful lighting, sharp, details, hyper - detailed, hd, hdr, 4 k, 8 k
+perfectly detailed wisteria flowers!! blessed by nature with ever - increasing physical mental perfection, symmetrical! intricate, sensual features, highly detailed, biblical divine holy perfection!! digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+portrait of teenage girl with long glossy black hair, blue eyes, glowing skin, fashion model features, fantasy, intricate, elegant, black dress, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by Krenz Cushart and Artem Demura and alphonse mucha
+a cinematic detailed painting of a black kid in the woods, volumetric light, surrealism, highly detailed, realistic, retro, in the style of francis bacon and james jean, trending on artstation, painting by Edward Hoper, colorful, realistic, smooth, octane render
+concept art from zaha hadid, futuristic, ultra realistic, concept art, intricate details, highly detailed, photorealistic, octane render, 8 k
+hyperrealistic mixed media painting of a grungy skull woman with rainbow hair, stitched together, soft eyes and narrow chin, dainty figure, long hair straight down, torn v plunge shirt, short shorts, combat boots, basic white background, side boob, wet tshirt, wet, raining, dim volumetric lighting, 8 k octane beautifully detailed render, post - processing, portrait, extremely hyper - detailed, intricate, epic composition, cinematic lighting, masterpiece, trending on artstation, very very detailed, masterpiece, stunning,
+cat theme logo, cat theme banner, cat design, a smiling cat, art photography style, trending on artstation, warm light, lovely and cute, fantasy art, 8 k resolution
+cover concept art of the lost sand city, levitating sand, ground view, golden towers, golden pillars, palm trees, space and time, floating objects, post-processing, in the style of Hugh Ferriss, Behance, Artgerm. High detail, ultra realistic render, octane, 3D, photorealism, symmetric, cinematic
+male anime character, oni mask, organic, forest druid, dark souls boss, cyber punk, portrait, male anime character, robot, masterpiece, intricate, highly detailed, sharp, technological rings, by james mccarthy, by beeple and johfra bosschart, combination in the style ayami kojima, highly detailed, painting, 3 d render beeple, unreal engine render, intricate abstract, intricate artwork, by tooth wu, wlop, beeple, dan mumford. concept art, octane render, trending on artstation, greg rutkowski very coherent symmetrical artwork. cinematic, key art, hyper realism, high detail, octane render, 8 k, iridescent accents, albedo from overlord, the library of gems, intricate abstract. intricate artwork, by tooth wu, wlop, beeple, dan mumford. concept art, octane render, trending on artstation, greg rutkowski very coherent symmetrical artwork. cinematic, key art, hyper realism, high detail, octane render, 8 k, iridescent accents
+Lionel Messi closeup, D&D style, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, art by Artgerm and Greg Rutkowski and Alphonse Mucha
+young shadow mage male, joyful, d & d, fantasy, intricate, elegant, full body, highly detailed, digital painting, artstation, concept art, matte, sharp, illustration, hearthstone, art by artgerm and greg rutkowski and alphonse mucha
+ultra realistic illustration, emma roberts from last of us, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+A beautiful cosmic entity || VERY ANIME, fine-face, realistic shaded perfect face, fine details. Anime. realistic shaded lighting poster by Ilya Kuvshinov katsuhiro otomo ghost-in-the-shell, magali villeneuve, artgerm, Jeremy Lipkin and Michael Garmash, Rob Rey and Kentar� Miura style, trending on art station
+muscular gandhi at the beach, sitting on the sand next to a campfire, with palm trees in the back, by artgerm, ilya kuvshinov katsuhiro villeneuve, jeremy lipkin and michael garmash and rob rey, disney pixar zootopia, by tristan eaton, stanley artgermm, tom bagshaw, greg rutkowski, carne griffiths
+A fancy portrait of an attractive humanoid creature by Greg Rutkowski, beeple, Sung Choi, Mitchell Mohrhauser, Maciej Kuciara, Johnson Ting, Maxim Verehin, Peter Konig, final fantasy, macro lens , 8k photorealistic, cinematic lighting, HD, high details, dramatic, dark atmosphere, trending on artstation
+headless horseman in a marvel movie, science fiction industrial hard science concept art, 8K render octane high definition cgsociety, photorealistic, unreal engine 5
+a highly detailed metahuman 4 k close up render of a goddess bella hadid monument renaissance in iris van herpen dress schiaparelli in diamonds crystals swarovski and jewelry iridescent in style of alphonse mucha gustav klimt trending on artstation made in unreal engine 4
+closeup portrait shot of a ring wraith in a scenic dystopian environment, intricate, elegant, highly detailed, centered, digital painting, artstation, concept art, smooth, sharp focus, illustration, artgerm, tomasz alen kopera, peter mohrbacher, donato giancola, joseph christian leyendecker, wlop, boris vallejo
+Redhead Pleiadian alien human beautiful hybrid feminine woman, long gorgeous red hair in loose curls, with stunning green eyes, cute round face and a roundish nose, as a retro futuristic heroine, gorgeous digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and donato giancola and Joseph Christian Leyendecker, Ross Tran, WLOP
+gigachad luigi bodybuilder in a expensive dress suit by ilya kuvshinov, ernest khalimov body by krista sudmalis, fantasy character portrait, futuristic town background by laurie greasley, ultra realistic, concept art, intricate details, elegent, digital painting, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha, artstation
+painting of a gorgeous young woman in the style of Martine Johanna, draped in flowing fabric, colorful energetic brush strokes, realistic, sharp focus, 8k high definition, insanely detailed, intricate, elegant, art by Martine Johanna and artgerm
+l � lawliet, hunchback, death note, d & d, fantasy, portrait, highly detailed, headshot, digital painting, trending on artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and magali villeneuve and wlop
+a portrait of Chiefkeef in front of an Art Nouveau mandala wearing a huge elaborate detailed ornate crown made of all types of realistic colorful flowers, turban of flowers, sacred Geometry, Golden ratio, surrounded by scattered flowers peonies dahlias lotuses roses and tulips, photorealistic face, Cinematic lighting, rimlight, detailed digital painting, Portrait, headshot, in style of Alphonse Mucha, Artgerm, WLOP, Peter Mohrbacher, William adolphe Bouguereau, cgsociety, artstation, Rococo and baroque styles, symmetrical, hyper realistic, 8k image, 3D, supersharp, pearls and oyesters, turban of vibrant flowers, satin ribbons, pearls and chains, perfect symmetry, iridescent, High Definition, Octane render in Maya and Houdini, light, shadows, reflections, photorealistic, masterpiece, smooth gradients, no blur, sharp focus, photorealistic, insanely detailed and intricate, cinematic lighting, Octane render, epic scene, 8K
+percy jackson in cyberpunk city, 4 k, trending on artstation.
+wonderdream faeries lady feather wing digital art painting fantasy bloom vibrant style mullins craig and keane glen and apterus sabbas and guay rebecca and demizu posuka illustration character design concept colorful joy atmospheric lighting butterfly
+Boris Johnson as Thor with hammer Mjolnir, Boris Johnson hairstyle, full body realistic portrait, highly detailed, muscular body, digital painting, artstation, concept art, smooth, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha
+An ancient Iranian fortress as Far Cry 4 concept art, spring season, beautiful, gorgeous buildings, , concept art by Viktor Vasnetsov, concept art, ancient era, warm lighting, soft by Ivan Shishkin, Dimitri Desiron and Antonio Lopez Garcia, hyperborea, high resolution, trending on artstation,
+high detailed white space station interior a statue jesus on cross made of red marble, perfect symmetrical body, full body shot, inflateble shapes, wires, tubes, veins, jellyfish, white biomechanical details, wearing epic bionic cyborg implants, masterpiece, intricate, biopunk, vogue, highly detailed, artstation, concept art, cyberpunk, octane render
+Award-Winning. Trending on Artstation. 8K. Corrupted Knight infected with black obsidian glowing red. Angular. Sharp. Ready for battle.
+2 0 year old ethiopian man, sitting on a black corvette, counting money, portrait, elegant, intricate, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by konstantin korovin and daniel f. gerhartz and john howe
+beautiful ethereal cyberpunk jennifer lawrence, art nouveau, fantasy, intricate binary and electronic designs, elegant, highly detailed, sharp focus, art by artgerm and greg rutkowski and wlop
+symmetry!! portrait of a horizon zero dawn machine acting as ironman, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha, 8 k
+beautiful portrait of a minority female wearing fantastic costume,pigtail,intricate, elegant, highly detailed, dim volumetric lighting, 8k,octane,post-processing,digital painting, trending on artstation, concept art, smooth, sharp focus, illustration,by Tom Bagshaw and Daniel Gerhartz and Albert Aublet and Lawrence Alma-Tadema and alphonse mucha
+a female elf sorceress by karol bak and jia ruan, beautiful detailed eyes, cute, fantasy, intricate, elegant, highly detailed, digital painting, 4 k, hdr, concept art, detailed jewelry, smooth, sharp focus, illustration, art by artgerm
+high quality 3 d render very cute cyborg labrador!! dog plays drums!, cyberpunk highly detailed, unreal engine cinematic smooth, in the style of blade runner & pixar, hannah yata charlie immer, moody light, low angle, uhd 8 k, sharp focus
+concept art of an intelligent bear, bipedal, wearing glasses and a vest, holding a spellbook under his arm, anthromorphic, artstation, fantasy
+symmetrical - face!! portrait shot of evil sithlord captain kirk from star trek in star wars, realistic, professionally, professionally color graded, intricate, elegant, highly detailed, centered, digital painting, artstation, concept art, smooth, sharp focus, illustration, artgerm, tomasz alen kopera, peter mohrbacher, donato giancola, joseph christian leyendecker, wlop, boris vallejo
+glorious full head portrait of abraham lincoln as Batman, fantasy, intricate, elegant, digital painting, trending on artstation, concept art, sharp focus, illustration by Gaston Bussiere and artgerm, 4k.
+a photorealistic 3 d seamless pattern of honey material with macro closeup details of circuits cables nvidia motherboard pcb futuristic robotic elements in glass and mirror in the style of zaha hadid, 3 d realistic model render in cyberpunk 2 0 7 7 colors, unreal engine 5, keyshot, octane, artstation trending, ultra high detail, ultra realistic, cinematic, 8 k, 1 6 k, large realistic elements in style of nanospace michael menzelincev, in style of lee souder, in plastic, dark atmosphere, tilt shift, depth of field
+skull - headed robot cyborg painting, illutstration, concept art, cyberpunk, futurism, comics art, artgerm
+hyper realistic portrait, beautifully rendered, luis guzman as luigi wearing green, smirking deviously, painted by greg rutkowski, wlop, artgerm, dishonored 2
+mahindra thar driving through madagascar with baobabs trees, artgerm and greg rutkowski and alphonse mucha, an epic fantasy, volumetric light, detailed, establishing shot, an epic fantasy, trending on art station, octane render, midsommar
+kurdish! assassins creed game set in kurdistan!, concept art, digital painting, highly detailed, 8 k, high definition
+portrait of ((mischievous)), baleful young Cate Blanchett as young Galadriel as a queen of fairies, dressed in a beautiful silver dress. The background is a dark, creepy eastern europen forrest. night, horroristic shadows, high contrasts, lumnious, photorealistic, dreamlike, (mist filters), theatrical, character concept art by ruan jia, thomas kinkade, and J.Dickenson, trending on Artstation
+elephant yoda playin socker, stunning digital art, high detail, in the style of artgerm, artstation, cgsociety, dramatic lighting, pixar 3d 8k
+photo of nikolas cage as ken from street fighter 2, shoulder length hair, high - contrast, intricate, action pose, highly detailed, centered, digital painting, artstation, smooth, sharp focus, illustration, artgerm, tomasz alen kopera, peter mohrbacher, donato giancola, joseph christian leyendecker, wlop, boris vallejo
+a portrait of frodo baggins, fantasy, sharp focus, intricate, elegant, digital painting, artstation, matte, highly detailed, concept art, illustration, ambient lighting, art by ilya kuvshinov, artgerm, alphonse mucha, and greg rutkowski
+a very beautiful anime elf girl, full body, long silver hair with a flower, sky blue eyes, full round face, short smile, revealing clothes, thick thigs, firm chest, ice snowy lake setting, cinematic lightning, medium shot, mid-shot, highly detailed, trending on Artstation, Unreal Engine 4k, cinematic wallpaper by Stanley Artgerm Lau, WLOP, Rossdraws, James Jean, Andrei Riabovitchev, Marc Simonetti, and Sakimichan
+a League of Legends FAN ART Portrait of VI, pink hair, short hair, elegant, highly detailed, digital painting, concept art, smooth, sharp focus, illustration, by Laurie Greasley,Lawrence Alma-Tadema,Dan Mumford,artstation,deviantart,Unreal Engine,face enhance,8K,golden ratio,cinematic lighting
+!dream concept art, four glam rockers dressd as a mix of hooligans and whores, walking down a dark wet london alley at night, by ashley wood, by roger deakins, atmospheric
+art portrait of death, 8 k, by tristan eaton, stanley artgermm, tom bagshaw, greg rutkowski, carne griffiths, trending on deviantart, face enhance, hyper detailed, minimalist cinematic lighting, trending on artstation, 4 k, hyperrealistic, focused, extreme details, unreal engine 5, cinematic, masterpiece, full of colour,
+A beautiful robotic woman dreaming, cinematic lighting, soft bokeh, sci-fi, modern, colourful, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, by greg rutkowski
+highly detailed vfx portrait of ichigo kurosaki from bleach by tite kubo!!!, stephen bliss, greg rutkowski, loish, rhads, beeple, makoto shinkai, tom bagshaw, alphonse mucha, sharp focus, art by artgerm and greg rutkowski, stanley kubrick, backlit!!,
+fantasy city at night while giant ball of fire crashes to the ground, surreal, digital art, concept art, highly detailed, trending on artstation
+concept art of a lightray trapped in vacuum, high definition, symmetrical, insanely detailed, elegant, intricate, hypermaximalist, cgsociety, prizewinning, trending on artstation, popular, top 1 0 0, best, winner, mentor, guru
+a dream microphone in a dystopic world full of aberration, black & white, melting, webbing, 8 k, by tristan eaton, stanley artgerm, tom bagshaw, greg rutkowski, carne griffiths, ayami kojima, beksinski, giger, trending on deviantart, face enhance, hyper detailed, minimalist, horror, alien
+link from zelda using computer, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha, 8 k
+lofi steampunk portrait pixar style by (((Lita Cabellut))) and Stanley Artgerm and Tom Bagshaw
+fractional reserve banking, watercolor, trending on artstation
+a painting of a tank getting shot at in world war 2 by Bernardo Bellotto, high detail, hyperrealistic, concept art, artstation, 8k
+a beautiful diva sings on the theater stage , octane render, cgsociety, artstation trending, palatial scene, highly detailded
+Ghibli, good day, landscape, no people, no man, fantasy, wood, vibrant world, Anime Background, concept art, illustration,smooth, sharp focus, intricate, super wide angle, trending on artstation, trending on deviantart, Hayao Miyazaki, 4K
+sapphire viking warrior, regal, elegant, winter, snow, beautiful, stunning, hd, illustration, epic, d & d, fantasy, intricate, elegant, highly detailed, wide angle, digital painting, artstation, concept art, smooth, sharp focus, illustration, wallpaper, art by artgerm and greg rutkowski and alphonse mucha and jin xiaodi
+fullbody!! dynamic action pose, beautiful woman with blue hair, antlers on her head, long flowing intricate black dress, dnd, face, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+A masterpiece ultrarealistic ultradetailed portrait of a Incredibly beautiful llama with dreadlocks IN INCREDIBLE GLASSES. baroque renaissance. in the forest. White corset. medium shot, intricate, elegant, highly detailed. trending on artstation, digital art, by Stanley Artgerm Lau, WLOP, Rossdraws, James Jean, Andrei Riabovitchev, Marc Simonetti, Yoshitaka Amano. background by James Jean and Gustav Klimt, light by Julie Bell, 4k, porcelain skin. BY ZDIZISLAW BEKSINSKI Cinematic concept art
+young harry potter as a gepard with gepard skin patterns hyper detailed, digital art, trending on artstation, cinematic lighting
+a dik dik monster with tattoos, wearing a fedora, tattoos, colorful, digital art, fantasy, magic, trending on artstation, ultra detailed, professional illustration by basil gogos
+a astronaut walking on a alien planet with alien plants and looking to a alien breathtaking landscape, cinematic lighting, concept art, trending on Artstation, trending on DeviantArt, highly detailed, high quality, 8K HDR, octane render, unreal engine 5, breathtaking landscape, highly detailed, high quality, post processed
+skeleton geisha in a burdel, Tending on artstation, concept art, dark colors, 8k
+realistic attractive grungy woman with rainbow hair, drunk, angry, soft eyes and narrow chin, dainty figure, long hair straight down, torn overalls, basic white background, side boob, tattooed, pierced, flirty, wet shirt, wet, raining, highly detailed face, realistic face, beautiful detailed eyes, fantasy art, in the style of greg rutkowski, illustration, epic, fantasy, intricate, hyper detailed, artstation, concept art, smooth, sharp focus, ray tracing, vibrant,
+colossal orange viking royal king tabby cat, golden hour, fantasy, vivid colors, sharp focus, digital art, hyper - realistic, 4 k, unreal engine, highly detailed, hd, dramatic lighting by brom, trending on artstation
+pencial drawing concept art of a machine mutant martial artist in the style of akira toriyama / hirohiko araki / tite kubo / masashi kishimoto trending on artstation deviantart pinterest detailed realistic hd 8 k high resolution
+goth rainbow bright, fantasy, d & d, intricate, detailed, by by alphonse mucha, adolfo hohenstein, alice russell glenny, stanley artgerm lau, greg rutkowski, detailed, trending on artstation, trending on artstation, smooth
+symmetry!! the eternal struggle of good and evil, very detailed, perfect lighting, perfect composition, 4 k, artstation, artgerm, derek zabrocki, greg rutkowski
+portrait of beautiful cute young goth girl with glasses, cyberpunk, high details, neon, art by ( ( ( kuvshinov ilya ) ) ) and wayne barlowe and gustav klimt and artgerm and wlop and william - adolphe bouguereau
+young Erin Gray as a ruggedly beautiful retro SCI-FI space heroine 1985 , intricate, elegant, highly detailed, centered, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and donato giancola and Joseph Christian Leyendecker, Ross Tran, WLOP
+a large colorful candy cane is sticking out the ground on the side of a serene foot path. there are some snow drifts laying against the candy. there are snow flurries in the air. epic, awe inspiring, dramatic lighting, cinematic, extremely high detail, photorealistic, cinematic lighting, trending on artstation cgsociety rendered in unreal engine, 4 k, hq,
+a hyper - detailed 3 d render like a oil painting of the construction of a upward spiral, surrealism!!!!! surreal concept art, lifelike, photorealistic, digital painting, aesthetic, smooth, sharp focus, artstation hd, by greg rutkowski, bruce pennington, valentina remenar and asher duran,
+a octane render of a violent tornado inside a jar, close - up studio photo, studio lighting, path traced, highly detailed, high quality, hyperrealistic, concept art, digital art, trending on artstation, cinematic, high coherence, epic scene, 8 k hdr, high contrast
+portrait of a young, ruggedly handsome ranger, muscular, half body, leather, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+the avengers fighting thanos, long shadow, warm colors, by Greg Rutkowski, artstation
+a person looking like vladimir putin riding giant steel krab, masterpiece, intricate, elegant futuristic wardrobe, highly detailed, digital painting, artstation, concept art, crepuscular rays, smooth, sharp focus, illustration, background galaxy, cyberpunk colors, volumetric lighting, art by artgerm and james jean and nick sullo
+book cover!!!!!!!!!!!!, old bridge, ivy vector elements at each border, fantasy forest landscape, fantasy magic, light night, intricate, elegant, sharp focus, illustration, highly detailed, digital painting, concept art, matte, art by wlop and artgerm and ivan shishkin and andrey shishkin, masterpiece
+a modernist courtroom in the rainforest by raphael, hopper, and rene magritte. detailed, proportional, romantic, vibrant, enchanting, achingly beautiful, graphic print, trending on artstation, jungle, tropical, foliage, flowering, blooming
+portrait of a beautiful mysterious woman holding a bouquet of flowing flowers, hair flowing upwards, small bubbles from her mouth, hands hidden under the bouquet, submerged underwater filled with colorful small fish and coral reef, fantasy, regal, intricate, by stanley artgerm lau, greg rutkowski, thomas kindkade, alphonse mucha, loish, norman rockwell
+portrait of a friendly charming formal barbarian giant noble!, imperial royal elegant clothing, elegant, rule of thirds, extremely detailed, artstation, concept art, matte, sharp focus, art by greg rutkowski, cover by artgerm
+goldfinger, character sheet, concept design, contrast, kim jung gi, greg rutkowski, zabrocki, karlkka, jayison devadas, trending on artstation, 8 k, ultra wide angle, pincushion lens effect
+a watercolor ink painting of scooby - doo as the primordial eldritch god of natural - disasters in the style of jean giraud in the style of moebius trending on artstation deviantart pinterest detailed realistic hd 8 k high resolution
+zidane and shrek wearing vr playing gta v, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, art by greg rutkowski and alphonse mucha
+full body portrait of marvel cinematic universe aaliyah haughton, she venom, spider man, elegant, webs, super hero, spider web background, highly detailed!! digital painting, artstation, glamor pose, concept art, sharp focus, illustration, art by artgerm and greg rutkowski, artey freytag
+demonic evil cute fourteen year old brown skinned asian girl, tomboy, evil smile, freckles!!!, fully clothed, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha,
+renfri, the princess turned gang leader from the witcher universe. fantasy art by greg rutkowski, gustave courbet, rosa bonheur, edward hopper. faithfully depicted facial expression, perfect anatomy, sharp focus, global illumination, radiant light, detailed and intricate environment, trending on artstation
+A chef with a big mustache proundly making a soup, digital painting, artstation, concept art, Craig Mullins, Breathtaking, 8k resolution, extremely detailed, beautiful, establishing shot, artistic, hyperrealistic, octane render, cinematic lighting, dramatic lighting, masterpiece, light brazen, extremely detailed and beautiful face
+a space realistic robot with big and cute eyes, | | very anime, fine - face, realistic shaded robotic parts, fine details. anime. realistic shaded lighting poster by ilya kuvshinov katsuhiro otomo ghost - in - the - shell, magali villeneuve, artgerm, jeremy lipkin and michael garmash, rob rey and kentaro miura style, trending on art station
+Portrait of The Most Beautiful old Woman On Earth , D&D, fantasy, intricate, richly detailed colored 3D illustration of a beautiful ornated cute body with long metallic hair wearing a hoodie and short skirt that is happy and curious smile. background with completely rendered reflections, art by Range Murata and Artgerm highly detailed, digital painting, trending on artstation, sharp focus, illustration, style of Stanley Artgerm, perfect smile and sexy mouth,
+lucifer cast out of heaven by yusuke murata and makoto shinkai, clouds, fire, angels, 8k, cel shaded, unreal engine, featured on artstation, pixiv
+A pirate ship in the middle of the sea during a storm, fantasy art, in the style of greg rutkowski, illustration, epic, fantasy, intricate, hyper detailed, artstation, concept art, smooth, sharp focus, ray tracing
+fox as a monkey, fluffy white fur, black ears, stunning green eyes, extremely long white tail with black tip, award winning creature portrait photography, extremely detailed, artstation, 8 k, sensual lighting, incredible art, wlop, artgerm
+autumn in french village, ornate, beautiful, atmosphere, vibe, mist, smoke, fire, chimney, rain, wet, pristine, puddles, melting, dripping, snow, creek, lush, ice, bridge, green, stained glass, forest, roses, flowers, by stanley artgerm lau, greg rutkowski, thomas kindkade, alphonse mucha, loish, norman rockwell
+Keanu Reeves as spiderman , film still, muscle extremely detailed, fantastic details full face, mouth, trending on artstation, pixiv, cgsociety, hyperdetailed Unreal Engine 4k 8k ultra HD, WLOP
+symmetry!! portrait of elon musk with a salvador dali moustache intricate, neon lights, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski
+Movie still of danny devito as as Harry Potter in potions class at hogwarts, fantasy, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, art by artgerm and Tony Sart
+a planet that resembles a skull, stars in the background, natural, ultra detail. digital painting, beautiful, concept art, ethereal, cinematic, epic, 8k, highly detail, insane detailed, oil painting, octane render, cinematic lighting, smooth, sharp, Artstation, mystical, illustration, Trending on Artstation, Artstation HQ, Artstation HD, digital art,
+anthropomorphic art of a businessman dragon, green dragon, dragon head, portrait, victorian inspired clothing by artgerm, victo ngai, ryohei hase, artstation. fractal papers and books. highly detailed digital painting, smooth, global illumination, fantasy art by greg rutkowsky, karl spitzweg
+Billie Eilish, sitting in a cafe, fantasy, intricate, elegant, highly detailed, digital painting, pale skin, artstation, concept art, matte, sharp focus, illustration, art by Artgerm and Greg Rutkowski and Alphonse Mucha
+a tornado made of fire on a field, au naturel, hyper detailed, digital art, trending in artstation, cinematic lighting, studio quality, smooth render, unreal engine 5 rendered, octane rendered, art style by klimt and nixeu and ian sprigger and wlop and krenz cushart
+a queue of grey people looking like perfect copies of each other, in the style of artgerm, gerald brom, atey ghailan and mike mignola, vibrant colors and hard shadows and strong rim light, plain background, comic cover art, trending on artstation
+vampire in the style of stefan kostic, realistic, full body shot, wide angle, sharp focus, 8 k high definition, insanely detailed, intricate, elegant, art by stanley lau and artgerm, floating embers
+fluffy cat in cowboy hat like a tiny girl riding on the back of a giant corgi, by greg rutkowski
+beautiful black woman elf wearing a dark green robe portrait, art nouveau, fantasy, intricate arcane wiccan designs, elegant, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, art by Artgerm and Greg Rutkowski and WLOP
+portrait of a wizard, intricate, highly detailed, digital painting, artstation, concept art, sharp focus, art by huifeng huang and greg rutkowski
+daredevil portrait, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and alphonse mucha
+a highly detailed illustration of tall beautiful red haired lady wearing black spaghetti strap noir style dress and sun hat, elegant stroking face pose, intricate, elegant, highly detailed, centered, digital painting, artstation, concept art, smooth, sharp focus, league of legends concept art, wlop.
+a hooded wise old man with a long white beard wearing a brown hooded tunic riding on top of a lion, the man riding is on the lion, the wise man is riding on top, he is all alone, majestic, epic digital art, cinematic, trending on artstation, superb detail 8 k, wide angle shot, masterpiece
+a lisa frank mcdonalds microwaved happymeal, gothic, highly detailed, digital painting, artstation, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha and william - adolphe bouguereau
+a beautiful highly detailed matte painting of a building looking like a goose by Jose Daniel Cabrera Pena and Leonid Kozienko, concept art by Tooth Wu and wlop and beeple and dan mumford and greg rutkowski and nekroxiii. octane render, cinematic, hyper realism, octane render, 8k, iridescent accents. vibrant, teal and gold blue red dark noir colour scheme
+a statue made of red marble, of an beautiful girl, full body shot, perfect body, red white biomechanical, inflateble shapes, wearing epic bionic cyborg implants, masterpiece, intricate, biopunk futuristic wardrobe, vogue, highly detailed, artstation, concept art, background galaxy, cyberpunk, octane render
+a ultradetailed beautiful panting of scarlett johansson as motoko kusanagi, by conrad roset, greg rutkowski and makoto shinkai, trending on artstation
+a detailed portrait of a weasel assassin dressed with a leather armor, by justin gerard and greg rutkowski, digital art, realistic painting, dnd, character design, trending on artstation
+a soldier zombie with a gas mask, pile of skulls, horror, black and white, fantasy art, monster art, illustration, fantasy, intricate, hyper detailed, artstation, concept art, smooth, sharp focus, ray tracing
+yvonne strahovski, very sexy penguin outfit, medium shot, visible face, detailed face, perfectly shaded, atmospheric lighting, by makoto shinkai, stanley artgerm lau, wlop, rossdraws
+concept art of love, death + robots series of netflix, cinematic shot, oil painting by jama jurabaev, brush hard, artstation, for aaa game, high quality, brush stroke
+Ottoman Emperor George Washington, diffuse lighting, fantasy, intricate, elegant, highly detailed, lifelike, photorealistic, digital painting, Ottoman armor, artstation, illustration, concept art, smooth, sharp focus, art by John Collier and Albert Aublet and Krenz Cushart and Artem Demura and Alphonse Mucha
+Full potrait of cinead o'connor as an angel, hyper realistic, prismatic highlights, atmosphere, gorgeous, depth of field, cinematic, macro, concept art, 50mm, artstation, wlop, elegant, epic, weta digital, focus, octane render, v-ray, 8k, kodak portra, art by Liberatore
+a film still of of a woman explorer, ( emerald herald ), exploring lost ruins, sun lighting, water, finely detailed features, perfect art, at an ancient city, gapmoe yandere grimdark, trending on pixiv fanbox, painted by greg rutkowski makoto shinkai takashi takeuchi studio ghibli,, akihiko yoshida
+a very beautiful young yuuki asuna, full body, long wavy blond hair, sky blue eyes, full round face,, bikini, miniskirt, front view, mid - shot, highly detailed, cinematic wallpaper by stanley artgerm lau
+symmetry!! portrait of jair bolsonaro, sci - fi, tech wear, glowing lights!! intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+an art and technology t - shirt, digital art from artstation, by ruan jia and mandy jurgens and artgerm and william - adolphe bouguereau fantasy, epic digital art, volumetrc lighting, clean detail 8 k resolution
+gamora, portrait, digital painting, elegant, beautiful, highly detailed, artstation, concept art
+a mechanized version of a norse woman, facial piercings, very symmetrical, furry warrior's bone clothing, highly detailed, by vitaly bulgarov, joss nizzi, ben procter, steve jung, concept art, concept art world, pinterest, artstation, unreal engine
+photorealistic dwayne johnson but he is made of rocks. hyperdetailed photorealism, 1 0 8 megapixels, amazing depth, glowing rich colors, powerful imagery, 3 d finalrender, 3 d shading, cinematic lighting, artstation concept art
+anime young boy with short wavy white hair wearing white clothes with short cape surrounded by light orbs, moody, wlop, concept art, digital painting, trending on artstation, highly detailed, epic composition, 8 k uhd
+a photo of 8 k ultra realistic humanoid princess standing next to a beautiful view, ornate white and gold officers outfit, cinematic lighting, trending on artstation, 4 k, hyperrealistic, focused, extreme details, unreal engine 5, cinematic, masterpiece
+epic 3 d abstract model, liquid headdress, 2 0 mm, with pastel pink and cerulean peanut butter, melting smoothly into other faces, liquid, delicate, beautiful, intricate, houdini sidefx, trending on artstation, by jeremy mann and ilya kuvshinov, jamie hewlett and ayami kojima
+Idris Elba as Superman (2019), zac snyder, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+kneeling before a condescending queen, royal gown, golden detailing, medium shot, intricate, elegant, highly detailed, digital painting, volumetric light, artstation, concept art, smooth, sharp focus, illustration, art by Gil Elvgren and Greg Rutkowski and Alphonse Mucha, 8K
+a highly detailed and high technology alien spacecraft, centered, corals, plume made of geometry, water texture, wet, wet lighting, extremly detailed digital painting, sharp focus in the style of android jones, artwork of a futuristic artificial intelligence superstar with frames made of detailed circuits, mystical colors, rim light, beautiful lighting, 8 k, stunning scene, raytracing, octane, under water visual distortion, dark tones colors, trending on artstation
+metalhead, by yoshitaka amano, ruan jia, kentaro miura, artgerm, detailed, intricate details, trending on artstation, hd, masterpiece
+futuristic utopian city, central hub, white buildings, golden sunset, space ships, green trees, large flying drones, utopia, high quality, hopeful, beautiful design, scifi, high detail, global illumination, trending on artstation, art by richard dumont, leon tukker
+Fae teenage girl, portrait, face, long red hair, green highlights, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+charizard flying above new york, highly detailed matte fantasy painting, stormy lighting, by ross tran, by artgerm, by lisa frank, by brom, by peter mohrbacher
+Glowing glass jar with a pink tentacle in green liquid, macro, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, volumetric lighting, cinematic, illustration, art by Artgerm and Greg Rutkowski and Alphonse Mucha
+jazz music, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by wlop, mars ravelo and greg rutkowski
+concept art of fried egg, highly detailed painting by dustin nguyen, akihiko yoshida, greg tocchini, greg rutkowski, cliff chiang, 4 k resolution, trending on artstation, 8 k
+portrait of a rugged ranger, muscular, upper body, hairy torso, detailed detailed detailed hands hands hands hands, D&D, fantasy, bare bare bare bare thighs thighs thighs intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm
+an extremely psychedelic portrait of hunter s. thompson, surreal, lsd, face, detailed, intricate, elegant, lithe, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration
+greg manchess portrait painting of snufkin as overwatch character, medium shot, asymmetrical, profile picture, organic painting, nebula, matte painting, bold shapes, hard edges, street art, trending on artstation, by huang guangjian and gil elvgren and sachin teng
+portrait of abandoned ribbed sculpture of two kissing cyborgs, covered with tentacles, roots, wires, tubes, ash, mold, baroque painting, standing in a desolate empty wasteland, creepy, nightmare, dream-like heavy atmosphere, dark fog, surreal abandoned buildings, baroque painting, beautiful detailed intricate insanely detailed octane render trending on Artstation, 8K artistic photography, photorealistic, volumetric cinematic light, chiaroscuro, zoomed out, Raphael, Caravaggio, Beksinski, Giger
+underwater naga portrait, Pixar style, by Tristan Eaton Stanley Artgerm and Tom Bagshaw.
+priestess with angelical wings, golden hair, fluorescent eyes, white skin, lipstick, beautiful, goodness, high fantasy, illustration, by artgerm, greg rutkowski, alphonse mucha
+a warlock is casting a magic spell, with magic orb floating in his hand , dynamic pose, natural lighting, medium level shot, Mucha style , Grim fantasy, illustration ,concept art,
+portrait of the secretive vampire woman biker loner smiling at her cat, by yoshitaka amano, casey baugh, steve caldwell, gottfried helnwein, yasunari ikenaga, nico tanigawa, and artgerm rendered with 3 d effect.
+aliens in Jerusalem, concept art, hd
+many Alchemy Imperial legends knights super hero boys girl, sci-fi, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha, fractal flame, amazing composition unreal engine
+concept art, of buffaloes on a beach, sunset, 30mm, canon, very hot, crowded, artstation
+portrait of john candy crying, john candy suffering, metaverse on fire, octane render, trending on artstation
+a portrait of a cyborg josip broz tito. vaporwave, intricate, epic lighting, cinematic composition, hyper realistic, 8 k resolution, unreal engine 5, by artgerm, tooth wu, dan mumford, beeple, wlop, rossdraws, james jean, marc simonetti, artstation
+anime portrait of the priestess in the forest, enchanted, magic, digital, concept art, Kyoto animation,last exile, blue submarine no. 6, katsura masakazu,tsutomu nihei, gustav klimt,loish, murata range, kawaii, studio lighting, manga, bright colors, anime,beautiful, 35mm lens,noir, vibrant high contrast, gradation, jean giraud, moebius, fantasy, rule of thirds, unreal engine, fibonacci, intricate, cel shaded, blender npr, flat, matte print, smooth, Ilya Kuvshinov, Tsuruta Kenji
+a large 1 8 th century pirate airship flying among the clouds, soaring through the sky, airship, digital art, pirate ship, vivid colors, artgerm, james gilleard, beautiful, highly detailed, intricate, trending on art station
+Taylor Swift Cosplaying Lola Bunny, modeling, posing, two piece workout clothes, training bra, quality lighting, vibrant colors, maximalism, facial details, photograph of Taylor Swift, Tooth Wu Artgerm WLOP artstation deviantart, 8k, fanart, playboy style, very very aesthetic
+a cartoon pineapple holding a large glass of port, nightclub, elegant, real life skin, intricate, high detailed, artstation, concept art, smooth, sharp focus, art by artgerm and greg rutkowski
+powerful goddess of water clothed in swirling water striding through a stormy sea, dress made of water, highly detailed matte fantasy painting, rendered in octane, stormy lighting, by ross tran, by artgerm, by david suh, by peter mohrbacher
+an extremely detailed matte painting emma watson as borg nine star trek, digital painting, beautiful eyes!, pretty face!!, symmetry, concept art, sharp focus, illustration, art by artgerm! greg rutkowski magali villeneuve wlop! ilya kuvshinov!!, octane render
+bruce campbell as harry potter in “ harry potter and the philosopher's stone ” ( 2 0 0 1 ). movie still detailed, smooth, sharp focus.
+a beautiful portrait of a tree goddess by Greg Rutkowski and Raymond Swanland, Trending on Artstation, ultra realistic digital art
+a beautiful masterpiece painting of the last poet whispering,'if all can begin again, then everything must continue!'by juan gimenez, long shiny black hair blue eyes, award winning, trending on artstation, photorealistic, hyperrealism, octane render, unreal engine
+cabin high on a mountain, the valley beneath, dynamic lighting, photorealistic fantasy concept art, trending on art station, stunning visuals, creative, cinematic, ultra detailed
+a painting so beautiful and universally loved it creates peace on earth, profound epiphany, trending on artstation, by john singer sargent
+Majestic powerfull red white Winged Hussars cavalry horde charging at ugly rainbow demons and trolls on ground, huge golden cross above them on the sky, white red eagle helping hussars, blood, snow, wide angle, professional kodak lenses, magic, fire, face painting, dramatic lighting, intricate, wild, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha, footage from space camera
+bob ross!! riding a dinosaur, giant paintbrush in hand, model pose, ultra realistic, concept art, intricate details, eerie, highly detailed, photorealistic, octane render, 8 k, unreal engine. art by artgerm and greg rutkowski and alphonse mucha
+Concept art of a Toilet-Plunger designed by Apple Inc
+colorful medieval botanical garden, ornate, beautiful, atmosphere, vibe, mist, smoke, chimney, rain, well, wet, pristine, puddles, waterfall, melting, dripping, snow, ducks, creek, lush, ice, bridge, cart, forest, flowers, concept art illustration, color page, 4 k, tone mapping, akihiko yoshida, james jean, andrei riabovitchev, marc simonetti, yoshitaka amano, digital illustration, greg rutowski, volumetric lighting, sunbeams, particles, trending on artstation
+a synthwave cuber bokeh brain, tristan eaton, victo ngai, artgerm, rhads, ross draws
+portrait of korean beautiful female necromancer, face, dark fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+Intergalactic plant store floating in space with white twinkling stars in the foreground, galactic terrarium filled with plants from alien planets floating in the cosmos, Filled with plants, warm ethereal glowing ambiance, concept art 8k resolution
+portrait of a friendly charming formal barbarian giant noble!, imperial royal elegant clothing, elegant, rule of thirds, extremely detailed, artstation, concept art, matte, sharp focus, art by greg rutkowski, cover by artgerm
+artgerm, joshua middleton comic cover art, full body pretty even rachel wood faye, symmetrical eyes, symmetrical face, long curly black hair, beautiful forest, cinematic lighting
+a forgotten garden gnome in a vast barren desert, hopeless wasteland background with a relentless raging sun overhead, an ultrafine detailed painting by stanley artgerm lau, greg rutkowski, thomas kindkade, alphonse mucha, loish, trending on deviantart, pop surrealism, whimsical, lowbrow, perfect symmetrical face, grotesque
+Portrait of a tall beautiful brown-skin elf woman wearing stylish black and gold robes, warm smile, intricate, elegant, highly detailed, digital painting, smooth, sharp focus, artstation, graphic novel, art by stanley artgerm and greg rutkowski and peter mohrbacher,
+a giant broken robots in rain after a huge battle, tired, rustic, dormant, sharp focus, james gilleard, cinematic, game art, extremely detailed digital painting, print
+biolevel 4 secret lab, alien autopsy, wide angle, super highly detailed, professional digital painting, artstation, concept art, smooth, sharp focus, no blur, no dof, extreme illustration, unreal engine 5, photorealism, hd quality, 8 k resolution, cinema 4 d, 3 d, beautiful, cinematic, art by artgerm and greg rutkowski and alphonse mucha and loish and wlop
+Gertrude Abercrombie, minimalistic graffiti masterpiece, minimalism, 3d abstract render overlayed, black background, psychedelic therapy, trending on ArtStation, ink splatters, pen lines, incredible detail, creative, positive energy, happy, unique, negative space, face, artgerm
+a dramatic, epic, ethereal painting of a !!handsome!! thicc chunky beefy mischievous shirtless man with a big beer belly wearing a large belt and cowboy hat offering a whiskey bottle | he is relaxing by a campfire | background is a late night with food and jugs of whisky | homoerotic | stars, tarot card, art deco, art nouveau, intricate | by Mark Maggiori (((and Alphonse Mucha))) | trending on artstation
+: sphere sculpture covered with maze pattern,hyper detailed art station parabolic lighting contest winners unrealengine trending on artstation,cinematic, hyper realism, high detail, octane render, 8k
+starfinder lashunta pilot, wearing a flight suit, in a space port, intricate, elegant, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, art by artgerm and greg rutkowski
+concept art of a futuristic gold warrior, large gold apendages on it's back, with a black obsidian helmet, tight armor, rough and jagged design | | epic - fine - fine details by stanley artgerm lau, wlop, rossdraws, and sakimichan, trending on artstation, brush strokes
+a study of cell shaded cartoon of a monk on a skateboard with technical analysis charts in the background, illustration, wide shot, subtle colors, post grunge, concept art by josan gonzales and wlop, by james jean, Victo ngai, David Rubín, Mike Mignola, Laurie Greasley, highly detailed, sharp focus, alien, Trending on Artstation, HQ, deviantart, art by artgem
+potato house interior design, Greg Rutkowski, trending on Artstation, 8K, ultra wide angle, pincushion lens effect.
+a angry knight in full plate of black armor, splattered with blood, riding a large black war horse, with red glowing eyes flowing red mane and tail, blackened clouds cover sky, crackling with lightning, a castle in distance burns, concept art by greg rutkowski, craig mullins, todd mcfarlane,
+sexy painting of 3 5 0 - pound taylor swift, red bikini, navel piercing, ultra realistic, sharp details, subsurface scattering, intricate details, warm lighting, beautiful features, highly detailed, photorealistic, octane render, 8 k, unreal engine, art by artgerm and greg rutkowski and alphonse mucha
+the grand canyon filled with glowing futuristic cyberpunk skyscrapers at night with a starry sky, cinematic, wide angle establishing shot, fantasy, hyperrealism, greg rutkowski, tuomas korpi, volumetric light, octane render, photorealistic concept art, highly detailed, very intricate
+Dwight Shrute as blue man. digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and donato giancola and Joseph Christian Leyendecker, Ross Tran, WLOP
+death, dark fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, wallpaper, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+a realistic illustration portrait of a beautiful cute girl with wavy black red hair, a pointy nose and, round chin black eyeliner, green pupills, trending on artstation, hyper - realistic lighting, intricate by imagineartforyou
+looking out to see a long wood dock on the water, child at end of dock, big fishing boat leaving the dock with sailors waving, low angle, long lens, sunset, a mediterranean phoenician fishing village in the distance, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and raphael lacoste and magali villeneuve
+splash art for new champion for league of legend, by riot games. trending on artstation
+Anime as Elizabeth Olsen playing Scarlet Witch || cute-fine-face, pretty face, realistic shaded Perfect face, fine details. Anime. realistic shaded lighting poster by Ilya Kuvshinov katsuhiro otomo ghost-in-the-shell, magali villeneuve, artgerm, Jeremy Lipkin and Michael Garmash and Rob Rey as Scarlet Witch in New York cute smile
+a portrait of apocalypse from x - men, fantasy, sharp focus, intricate, elegant, digital painting, artstation, matte, highly detailed, concept art, illustration, ambient lighting, art by ilya kuvshinov, artgerm, alphonse mucha, and greg rutkowski
+portrait of radical lolita girl, dreamy and ethereal and dark, dark eyes, smiling expression, ornate goth dress, dark fantasy, chaotic, elegant, black crows flying, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+photography of man playing realistic virtual reality game in a giant mine with excavators and gnomes 3 d realistic model render in the style of zaha hadid with point cloud in the middle, in cyberpunk 2 0 7 7 colors, unreal engine 5, keyshot, octane, artstation trending, ultra high detail, ultra realistic, cinematic, 8 k, 1 6 k, in style of zaha hadid, in style of nanospace michael menzelincev, in style of lee souder, in plastic, dark atmosphere, tilt shift, depth of field
+greg manchess portrait painting of armored starlord as overwatch character, medium shot, asymmetrical, profile picture, organic painting, sunny day, matte painting, bold shapes, hard edges, street art, trending on artstation, by huang guangjian and gil elvgren and sachin teng
+full lenght shot, super hero pose, biomechanical dress, inflateble shapes, wearing epic bionic cyborg implants, masterpiece, intricate, biopunk futuristic wardrobe, highly detailed, art by akira, mike mignola, artstation, concept art, background galaxy, cyberpunk, octane render
+alterd carbon, masked angel protecting girl and a woman, vampre the masquerade, neon, detailed intricate render, dark atmosphere, detailed illustration, hd, 4 k, digital art, overdetailed art, surrealistic, by greg rutkowski, by loish, complementing colors, trending on artstation, deviantart
+fullbody portrait of a beautiful girl dressed in cyberpunk style, standing on street, holding a sniper rifle. by riot games, anime style, masterpiece, award - winning, trending on artstation and pixiv
+thief red riding hood, d & d, fantasy, portrait, highly detailed, digital painting, trending on artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and magali villeneuve
+anthropomorphic highly detailed group portrait of funny neon giant cute eyes dust mephit, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm, bob eggleton, michael whelan, stephen hickman, richard corben, wayne barlowe, trending on artstation and greg rutkowski and alphonse mucha, 8 k
+beautiful girl galaxy background, portrait character concept style trending on artstation concept art detailed octane render cinematic photo - realistic 8 k high detailed
+realistic render of flying blue whales towards the moon, intricate, toy, sci - fi, extremely detailed, digital painting, sculpted in zbrush, artstation, concept art, smooth, sharp focus, illustration, chiaroscuro lighting, golden ratio, incredible art by artgerm and greg rutkowski and alphonse mucha and simon stalenhag
+a highly detailed epic cinematic concept art CG render digital painting artwork: Steampunk Wizard stands and looks at the Tower of Babel in the distance. By Greg Rutkowski, in the style of Francis Bacon and Syd Mead and Norman Rockwell and Beksinski, open ceiling, highly detailed, painted by Francis Bacon and Edward Hopper, painted by James Gilleard, surrealism, airbrush, Ilya Kuvshinov, WLOP, Stanley Artgerm, very coherent, triadic color scheme, art by Takato Yamamoto and James Jean
+devastated scorched earth in the valley, burnt trees, burnt vegetation and grass, cinematic view, epic sky, detailed, concept art, low angle, high detail, warm lighting, volumetric, godrays, vivid, beautiful, trending on artstation, by jordan grimmer, huge scene, grass, art greg rutkowski
+portrait Ninja gaiden girl, armored black and red ninja wardrobe, in ruin japanese rainny temple night, ssci-fi and fantasy, intricate and very very beautiful and elegant, highly detailed, digital painting, artstation, concept art, smooth and sharp focus, illustration, art by tian zi and WLOP and alphonse mucha
+girl jumping near a lake, rainy, touching a long neck monster, illustration concept art anime key visual trending pixiv fanbox by wlop and greg rutkowski and makoto shinkai and studio ghibli
+beautiful digital painting of a hoyeon jung stylish female snow - covered mountains with high detail, real life skin, freckles, 8 k, stunning detail, works by artgerm, greg rutkowski and alphonse mucha, unreal engine 5, 4 k uhd
+a potrait of a human rogue, fine details. night setting. realistic shaded lighting poster by ilya kuvshinov katsuhiro, artgerm, jeremy lipkin and michael garmash, unreal engine, radiant light, detailed and intricate environment, digital art, trending on art station
+portrait full body girl 3 kingdom breathtaking detailed concept art painting art deco pattern of birds goddesses amalmation flowers head thibetan temple, by hsiao ron cheng, tetsuya ichida, bizarre compositions, tsutomu nihei, exquisite detail, extremely moody lighting, 8 k, art nouveau, old chines painting, art nouveau
+greg manchess portrait painting of armored sanguinius with huge wings as overwatch character, medium shot, asymmetrical, profile picture, organic painting, sunny day, matte painting, bold shapes, hard edges, street art, trending on artstation, by huang guangjian and gil elvgren and sachin teng
+rendering of old hands reaching forward, concept art, high detail, intimidating, cinematic, Artstation trending, octane render
+dynamic photography portrait of a dungeons and dragons king's colosse , intricate ornate armor, subject in the middle of the frame, rule of thirds, golden ratio, elegant, digital painting, octane 4k render, zbrush, hyperrealistic, artstation, concept art, smooth, sharp focus, illustration from Warcraft by Ruan Jia and Mandy Jurgens and Artgerm and William-Adolphe Bouguerea
+anthropomorphic triangle brain in edgy darkiron badger demon, intricate, elegant, highly detailed animal monster, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm, dwayne barlowe, trending on artstation and greg rutkowski and alphonse mucha, 8 k
+christiano ronaldo, manga cover art, detailed color portrait, artstation trending, 8 k, greg rutkowski
+a faceless!!!!! woman posing for the camera, charcoal painting!!!!! illustrated by kathe kollwitz, trending on artstation, 4 k, 8 k, artstation hd, artstation hq, artistic interpretation, 1 9 5 0 s style
+a potrait of a female necromancer with big and cute eyes, fine - face, realistic shaded perfect face, fine details. night setting. very anime style. realistic shaded lighting poster by ilya kuvshinov katsuhiro, magali villeneuve, artgerm, jeremy lipkin and michael garmash, rob rey and kentaro miura style, trending on art station
+fractal tarot card of a naturepunk retrofuture nexus of technology and earth, beautiful detailed realistic cinematic character high concept fashion portrait, hi - fructose art magazine, by anton fadeev and paul lehr and david heskin and josan gonzalez, 8 k
+realistic high key portrait rendering of a beautiful curvy pale alabaster goth girl with asymmetrical punk rock hair and badass euro design sunglasses. mole on cheek. half portrait by stanley artgerm, dramatic lighting, by tohuvabohu, nagel, shin jeongho, nick silva and ilya kuvshinov, deviantart, detailed character design, 8 k resolution
+the world serpent ultra detailed fantasy, elden ring, realistic, dnd character portrait, full body, dnd, rpg, lotr game design fanart by concept art, behance hd, artstation, deviantart, global illumination radiating a glowing aura global illumination ray tracing hdr render in unreal engine 5
+helmet of a forgotten deity with a labyrinth, in the style of tomasz alen kopera and fenghua zhong and peter mohrbacher, mystical colors, rim light, beautiful lighting, 8 k, stunning scene, raytracing, octane, trending on artstation
+duotone psychedelic concept illustration 3 / 4 portrait of dr. albert hofmannn taking bicycle trip fractals background. cinematic scene. vlumetric lighting. golden rario accidental renaissance. by sachin teng and sergey kolesov and ruan jia and heng z. graffiti art, scifi, fantasy, hyper detailed. octane render. concept art. trending on artstation
+A very tall, slender woman wearing black puffy clothes and holding a yellow umbrella, sharp focus, intricate, elegant, digital painting, artstation, matte, highly detailed, concept art, illustration, ambient lighting, art by artgerm, Alphonse mucha, and Greg Rutkowski
+vibrant! colorful!!! the last supper of simpsons by rene magritte, futurama by laurie greasley and bouguereau, ( ( etching by gustave dore ) ), ultraclear intricate, sharp focus, highly detailed digital painting illustration, concept art, masterpiece
+award winning brandmark for a research lab, mind wandering, hip corporate, no text, trendy, vector art, concept art
+an epic non - binary model, subject made of white mesh rope, with cerulean and pastel pink bubbles bursting out, delicate, beautiful, intricate, melting into a wolf, houdini sidefx, by jeremy mann and ilya kuvshinov, jamie hewlett and ayami kojima, trending on artstation, bold 3 d
+character concept of iridescent sinewy smooth muscular male sleek glossy indigo black pearlescent scifi armor with smooth black onyx featureless helmet, by greg rutkowski, mark brookes, jim burns, tom bagshaw, magali villeneuve, trending on artstation
+phil noto, peter mohrbacher, thomas kinkade, artgerm, 1 9 5 0 s rockabilly anya taylor - joy catwoman dc comics, pompadour, long hair, vines, symmetrical eyes, city rooftop
+dark high detailed space station interior a statue jesus on cross made of white marble, perfect symmetrical body, full body shot, inflateble shapes, wires, tubes, veins, jellyfish, white biomechanical details, wearing epic bionic cyborg implants, masterpiece, intricate, biopunk, vogue, highly detailed, artstation, concept art, cyberpunk, octane render
+Portrait of a man by Greg Rutkowski, a young, strong and hard-eyed futuristic warrior with brown hair with dreadlocks, wearing a futuristic space tactical gear that looks like a mix between the samurai, viking and templar aesthetics, mix between tribal and hi-tech, highly detailed portrait, scifi, space opera, digital painting, artstation, concept art, smooth, sharp foccus ilustration, Artstation HQ
+epic professional digital art of a snail in a blue professional business suit, sitting at a desk, best on artstation, cgsociety, wlop, Behance, pixiv, astonishing, impressive, outstanding, epic, cinematic, stunning, gorgeous, much detail, much wow, masterpiece
+hyper realistic oil painting of frozen little island planet with waterfall, rising in the air, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+digitalart scifi!!! wallpaper trending on artstation
+concept art of car designed by jony ive, jama jurabaev, science fiction, brush hard, artstation, cgsociety, high quality, brush stroke
+A beautiful oil cartoony painting of a happy Remi Malek riding a tricycle by Lucas Graciano, Frank Frazetta, Greg Rutkowski, Boris Vallejo, epic fantasy character art, high fantasy, Exquisite detail, post-processing, low angle, masterpiece, cinematic
+female priest in white cloak, ultra detailed fantasy, dndbeyond, bright, colourful, realistic, dnd character portrait, full body, pathfinder, pinterest, art by ralph horsley, dnd, rpg, lotr game design fanart by concept art, behance hd, artstation, deviantart, hdr render in unreal engine 5
+a beautiful portrait of death goddess by Greg Rutkowski and Raymond Swanland, ominous background, Trending on Artstation, ultra realistic digital art
+cyborg drug addict, diffuse lighting, fantasy, intricate, elegant, highly detailed, lifelike, photorealistic, digital painting, artstation, illustration, concept art, smooth, sharp focus, art by John Collier and Albert Aublet and Krenz Cushart and Artem Demura and Alphonse Mucha
+a well designed portrait of viper, detailed, realistic, sketch style, artstation, greg rutkowski, 8 k resolution.
+Moira Stewart as Warhammer 40k Battle Sister, portrait, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+hu tao from genshin impact, hu tao, perfect face, collaborative painting by greg ruthowski, ruan jia, artgerm, highly detailed, complex, exquisite and beautiful, 4 k, 8 k, artstation
+rockstar girl playing electric guitar on stage. by amano yoshitaka, by rembrandt, digital art, digital painting, artstation trending, unreal engine
+beautiful small cyberpunk robot-owl in the deep jungle, with neon color eyes, cinematic view, 8k, ultra realistic, vibrant colors, photo realism, trending artstation, octane render, volumetric lighting, high contrast, intricate, highly detailed, digital painting
+john lennon as jack the ripper, ultra realistic, concept art, intricate details, highly detailed, photorealistic, octane render, 8 k, unreal engine, art by frank frazetta, simon bisley, brom
+lux, from league of legends, au naturel, hyper detailed, digital art, trending in artstation, cinematic lighting, studio quality, smooth render, unreal engine 5 rendered, octane rendered, art style by klimt and nixeu and ian sprigger and wlop and krenz cushart
+Anime art of beautiful Hatsune miku with beautifel legs by artgerm, ross tran, magali villeneuve, Greg Rutkowski, Gil Elvgren, Alberto Vargas, Earl Moran,, Art Frahm, Enoch Bolles
+Daniel Radcliffe wearing a monks tunic holding a glowing fire magical staff. Trending on Artstation, octane render, ultra detailed, art by Ross tran
+an super mega hyper realistic image of a super soldier with a Ukrainian blue and yellow stripes flag standing in the beam of light from the clouds on a pile of skulls as a winner, masculine figure, D&D, fantasy, intricate, elegant, highly detailed, extremely detailed, digital painting, artstation, concept art, matte, sharp focus, symmetrical, illustration, art by Artgerm and Greg Rutkowski and Alphonse Mucha
+poster woman with futuristic streetwear and hairstyle, open jacket, cute face, symmetrical face, 3/4 angle, pretty, beautiful, elegant, Anime by Kuvshinov Ilya, Cushart Krentz and Gilleard James, 4k, HDR, Trending on artstation, Behance, Pinterest
+man with fluffy pipidastr, atmosphere, glow, detailed, intricate, full of colour, cinematic lighting, trending on artstation, 4 k, hyperrealistic, focused, extreme details, unreal engine 5, cinematic, masterpiece, moody lighting, by greg rutkowski, wlop, artgerm, trending on artstation, concept art, sharp focus, ray tracing
+a cinematic scene from the cthulhu in pyrrhic victory, concept art by beksinski and jean delville, dramatic lighting, ultra hd, hdr, 8 k
+indistinct glowing prehistoric beasts surrounded by slate grey walls, insane details, dramatic lighting, unreal engine 5, concept art, greg rutkowski, james gurney, johannes voss, hasui kawase.
+a full body portrait of a beautiful post apocalyptic offworld nordic desert snake charmer dancing playfully by the waterfalls, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by krenz cushart and artem demura and alphonse mucha
+dia de los muertos theme poster art by artemio rodriguez, aida muluneh, and gustave bauman, intricate, accurate facial details, profile picture, artgerm, retro, nostalgic, old fashioned, posterized color
+digital artwork, illustration, cinematic camera, a cyborg pilot in the cockpit of a mech, intricate machinery, biomechanics, the ghosts in the machine, cyberpunk concept art by artgerm and Guy Denning and Greg Rutkowski and Ruan Jia, highly detailed, intricate, sci-fi, sharp focus, Trending on Artstation HQ, deviantart
+breathtaking detailed soft painting of a grim reaper with an intricate golden scythe and cloak of fireflies and embers, rembrandt style, detailed art nouveau stained glass of flames background, christian saint rosace, elegant, highly detailed, artstation, concept art, matte, sharp focus, art by Tom Bagshaw, Artgerm and Greg Rutkowski
+portrait of mischievous, enigmatic!!, dangerous youngster Galadriel (Cate Blanchett) as a queen of elves, dressed in a refined silvery garment. The background is a dark, chilling eastern european forrest. night, horroristic shadows, blue tones, higher contrasts, (((lumnious))), theatrical, character concept art by ruan jia, (((thomas kinkade))), and J.Dickenson, trending on Pinterest, ArtStation
+portrait painting of a bloodied serial killer wearing a hello kitty mask, ultra realistic, concept art, intricate details, eerie, highly detailed, photorealistic, octane render, 8 k, unreal engine. art by artgerm and greg rutkowski and alphonse mucha
+An old man trapped in a cave, looking into a mirror, b&w, fantasy art, in the style of masami kurumada, illustration, epic, fantasy, intricate, hyper detailed, artstation, concept art, smooth, sharp focus, ray tracing
+close-up macro portrait of the face of a beautiful princess with animal skull mask, epic angle and pose, ribcage bones symmetrical artwork, 3d with depth of field, blurred background, cybernetic jellyfish female face skull phoenix bird, translucent, nautilus, energy flows of water and fire. a highly detailed epic cinematic concept art CG render. made in Maya, Blender and Photoshop, octane render, excellent composition, cinematic dystopian brutalist atmosphere, dynamic dramatic cinematic lighting, aesthetic, very inspirational, arthouse. y Greg Rutkowski, Ilya Kuvshinov, WLOP, Stanley Artgerm Lau, Ruan Jia and Fenghua Zhong
+poseidon humanoid god of the sea, trident, highly detailed, d & d, fantasy, highly detailed, digital painting, trending on artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and magali villeneuve
+grainy and distorted xerox of a classified scientific government chart diagram a portal to a higher dimension photorealistic 4k photorealism realistic textures sharpened x-files fringe mystery sci-fi cinematic detailed texture hyperdetailed CIA agency NSA DOD government seal redacted continuous feed paper smooth, sharp focus, illustration, from Metal Gear, Greg Rutkowski and Artgerm artgerm
+martin shkreli in attack on titan, medium shot close up, details, sharp focus, illustration, by jordan grimmer and greg rutkowski, trending artstation, pixiv, digital art
+painting Daft Punk in long coat, elegant, intricate, headshot, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+colleen moore 2 2 years old, bob haircut, portrait painted by stanley artgerm, casting long shadows, resting head on hands, by ross tran
+hyper realistic photography of a stunningly beautiful sphere, self assembly, ribbons, glowing consciences, growing tendrils, hand in the style of beth cavener, jin kagetsu,, and wlop, highly detailed, intricate filigree, symmetry, masterpiece, award winning, sharp focus, concept art, highkey lighting, ambient lighting, octane render, 8 k, artstation
+a photo of larry david playing poker while smoking highly detailed, dim volumetric lighting, 8k, post-processing, soft painting, trending on artstation, concept art, smooth, sharp focus, illustration,by Tom Bagshaw and Daniel Gerhartz and Albert Aublet and Lawrence Alma-Tadema and alphonse mucha
+symmetry!! portrait of space soldier, tech wear, scifi, glowing lights!! intricate elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+portrait of female android, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by fra angelico
+jason fused with a poltergeist lovecraft nervous space demon, spiky skin, creepy, melting, big eyes, photo, portrait, 3 d, high details, intricate details, by vincent di fate, artgerm julie bell beeple, 9 0 s, smooth gradients, volumetric lightning, high contrast, duo tone, depth of field, very coherent symmetrical artwork
+portrait of a fat blue alien. big friendly smile. character concept art. science fiction illustration. close up of the face. key panel art graphic novel. detailed face, beautiful colour palette. digital painting.
+people with posters attacking cops in front a huge blue spiral - shaped white luminous attractor that is floating on the horizon near the sun and stores in los angeles with light screens all over the street, concept art, art for the game, professional lighting, dark night lighting from streetlights
+Lofi cyberpunk portrait beautiful woman with short brown curly hair, roman face, Romanesque, unicorn, rainbow, floral, Pixar style, Tristan Eaton, Stanley Artgerm, Tom Bagshaw
+dog eat dog world , made by Stanley Artgerm Lau, WLOP, Rossdraws, ArtStation, CGSociety, concept art, cgsociety, octane render, trending on artstation, artstationHD, artstationHQ, unreal engine, 4k, 8k,
diff --git a/demo/Diffusion/calibration.py b/demo/Diffusion/calibration.py
new file mode 100644
index 00000000..98adb6d3
--- /dev/null
+++ b/demo/Diffusion/calibration.py
@@ -0,0 +1,177 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import types
+from typing import Callable, Optional, Union
+
+import numpy as np
+import torch
+import torch.distributed as dist
+import torch.nn as nn
+from torch.distributed import ReduceOp
+from utilities import PercentileAmaxes
+
+from ammo.torch.quantization.model_calib import (
+ enable_stats_collection,
+ finish_stats_collection,
+ max_calibrate,
+)
+from ammo.torch.quantization.utils import is_quantized_linear
+
+
+def precentile_calib_mode(base_unet, quant_config={}):
+ def compute_amax(self, all_reduce=True):
+ """Return the absolute max of all tensors collected."""
+ if (
+ self._calib_amax is not None
+ and all_reduce
+ and dist.is_available()
+ and dist.is_initialized()
+ and dist.get_world_size() > 1
+ ):
+ tmp_amax = self._calib_amax.clone()
+ dist.all_reduce(tmp_amax, op=ReduceOp.MAX)
+ self._calib_amax.copy_(tmp_amax)
+ if self._track_amax:
+ up_lim = int(self._amaxs.total_step * self._amaxs.percentile)
+ if up_lim <= 0:
+ up_lim = 1
+ amaxs_values = [self._amaxs.data[i] for i in range(0, up_lim)]
+ act_amax = (
+ torch.tensor(np.vstack(amaxs_values).min(axis=0))
+ .float()
+ .squeeze(0)
+ .to(self._calib_amax.device)
+ .to(self._calib_amax.dtype)
+ )
+ return act_amax
+ return self._calib_amax
+
+ for _, module in base_unet.named_modules():
+ if isinstance(module, (nn.Linear, nn.Conv2d)):
+ module.input_quantizer._calibrator._track_amax = True
+ module.input_quantizer._calibrator._amaxs = PercentileAmaxes(
+ total_step=quant_config["base-step"], percentile=quant_config["percentile"]
+ )
+ module.input_quantizer._calibrator.compute_amax = types.MethodType(
+ compute_amax, module.input_quantizer._calibrator
+ )
+
+
+@torch.no_grad()
+def smoothquant(model, forward_loop=None):
+ """
+ Rewrite the original SmoothQuant method
+ """
+ assert forward_loop is not None, "forward_loop must be provided for smoothquant"
+ max_calibrate(model, forward_loop)
+
+ smoothed_modules = 0
+ for name, module in model.named_modules():
+ if is_quantized_linear(module):
+ if not hasattr(module.input_quantizer, "_amax"):
+ print(f"Warning: {name} is not calibrated, skip smoothing")
+ continue
+ if module.input_quantizer.num_bits != 8 or module.weight_quantizer.num_bits != 8:
+ print(f"Warning: only int8 smoothing is supported, skip {name}")
+ continue
+ if module.input_quantizer.axis != -1:
+ print(f"Warning: only per-channel smoothing is supported, skip {name}")
+ continue
+
+ alpha = 1.0
+ if hasattr(module, "alpha"):
+ alpha = module.alpha
+ assert (
+ module.input_quantizer._amax.numel() > 1
+ ), f"Error: {name} has only one channel to smooth"
+
+ # It is important to keep scaling math in fp32 to be numerically safe
+ act_amax = module.input_quantizer.amax.float()
+
+ act_device = act_amax.device
+
+ # If model is split across devices, this tensor may be on wrong one
+ act_amax = act_amax.to(module.weight.device)
+
+ weight_scale = module.weight.abs().max(dim=0, keepdim=True)[0]
+ scale_a = (weight_scale.pow(1 - alpha) / act_amax.pow(alpha)).squeeze()
+
+ # Some channel could have 0 amax which causes scale_a to overflow. Explicitly mask them out here
+ epsilon = 1.0 / (1 << 31)
+ if act_amax.min() <= epsilon:
+ zero_mask = act_amax <= epsilon
+ scale_a[zero_mask] = 1
+ inv_scale_a = 1.0 / scale_a
+ inv_scale_a = inv_scale_a.squeeze()[None, :]
+
+ # Use per-tensor quantization for activation, add a pre-quantization scale vector
+ module.input_quantizer.pre_quant_scale = scale_a.to(module.weight.dtype).to(act_device)
+ module.input_quantizer._axis = None
+ delattr(module.input_quantizer, "_amax")
+ module.input_quantizer.amax = torch.tensor(
+ (act_amax * scale_a).max().item(),
+ dtype=module.weight.dtype,
+ device=module.weight.device,
+ )
+
+ # Multiply weight by inv_scale_a and recalibrate
+ module.weight.detach().copy_(
+ (module.weight.float() * inv_scale_a).to(module.weight.dtype)
+ )
+
+ enable_stats_collection(module.weight_quantizer)
+ module.weight_quantizer(module.weight)
+ finish_stats_collection(module.weight_quantizer)
+
+ smoothed_modules += 1
+ print(f"Smoothed {smoothed_modules} modules")
+
+
+def calibrate(
+ model: nn.Module,
+ algorithm: Union[str, dict, None] = "max",
+ forward_loop: Optional[Callable] = None,
+) -> None:
+ if algorithm is None:
+ return
+
+ if isinstance(algorithm, str):
+ kwargs = {}
+ elif isinstance(algorithm, dict):
+ kwargs = algorithm.copy()
+ algorithm = kwargs.pop("method")
+ else:
+ raise TypeError(f"Unsupported type for algorithm: {type(algorithm)}")
+
+ if algorithm == "smoothquant":
+ smoothquant(model, forward_loop)
+ elif algorithm == "max":
+ max_calibrate(model, forward_loop)
+ else:
+ raise ValueError(f"Unsupported calibration algorithm: {algorithm}")
+
+
+def reg_alpha_qkv(base_unet, alpha):
+ """
+ Only apply alpha to QKV layers
+ """
+ for name, module in base_unet.named_modules():
+ if isinstance(module, torch.nn.Linear):
+ if "to_q" in name or "to_k" in name or "to_v" in name:
+ module.alpha = alpha
+
diff --git a/demo/Diffusion/demo_controlnet.py b/demo/Diffusion/demo_controlnet.py
new file mode 100644
index 00000000..a730935d
--- /dev/null
+++ b/demo/Diffusion/demo_controlnet.py
@@ -0,0 +1,123 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import argparse
+
+import controlnet_aux
+import torch
+from cuda import cudart
+from PIL import Image
+
+from stable_diffusion_pipeline import StableDiffusionPipeline
+from utilities import PIPELINE_TYPE, TRT_LOGGER, add_arguments, download_image, process_pipeline_args
+
+def parseArgs():
+ parser = argparse.ArgumentParser(description="Options for Stable Diffusion ControlNet Demo", conflict_handler='resolve')
+ parser = add_arguments(parser)
+ parser.add_argument('--scheduler', type=str, default="UniPC", choices=["DDIM", "DPM", "EulerA", "LMSD", "PNDM", "UniPC"], help="Scheduler for diffusion process")
+ parser.add_argument('--input-image', nargs = '+', type=str, default=[], help="Path to the input image/images already prepared for ControlNet modality. For example: canny edged image for canny ControlNet, not just regular rgb image")
+ parser.add_argument('--controlnet-type', nargs='+', type=str, default=["canny"], help="Controlnet type, can be `None`, `str` or `str` list from ['canny', 'depth', 'hed', 'mlsd', 'normal', 'openpose', 'scribble', 'seg']")
+ parser.add_argument('--controlnet-scale', nargs='+', type=float, default=[1.0], help="The outputs of the controlnet are multiplied by `controlnet_scale` before they are added to the residual in the original unet, can be `None`, `float` or `float` list")
+ return parser.parse_args()
+
+if __name__ == "__main__":
+ print("[I] Initializing StableDiffusion controlnet demo using TensorRT")
+ args = parseArgs()
+
+ # Controlnet configuration
+ if not isinstance(args.controlnet_type, list):
+ raise ValueError(f"`--controlnet-type` must be of type `str` or `str` list, but is {type(args.controlnet_type)}")
+
+ # Controlnet configuration
+ if not isinstance(args.controlnet_scale, list):
+ raise ValueError(f"`--controlnet-scale`` must be of type `float` or `float` list, but is {type(args.controlnet_scale)}")
+
+ # Check number of ControlNets to ControlNet scales
+ if len(args.controlnet_type) != len(args.controlnet_scale):
+ raise ValueError(f"Numbers of ControlNets {len(args.controlnet_type)} should be equal to number of ControlNet scales {len(args.controlnet_scale)}.")
+
+ # Convert controlnet scales to tensor
+ controlnet_scale = torch.FloatTensor(args.controlnet_scale)
+
+ # Check images
+ input_images = []
+ if len(args.input_image) > 0:
+ for image in args.input_image:
+ input_images.append(Image.open(image))
+ else:
+ for controlnet in args.controlnet_type:
+ if controlnet == "canny":
+ canny_image = download_image("https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png")
+ canny_image = controlnet_aux.CannyDetector()(canny_image)
+ input_images.append(canny_image.resize((args.height, args.width)))
+ elif controlnet == "normal":
+ normal_image = download_image("https://huggingface.co/lllyasviel/sd-controlnet-normal/resolve/main/images/toy.png")
+ normal_image = controlnet_aux.NormalBaeDetector.from_pretrained("lllyasviel/Annotators")(normal_image)
+ input_images.append(normal_image.resize((args.height, args.width)))
+ elif controlnet == "depth":
+ depth_image = download_image("https://huggingface.co/lllyasviel/sd-controlnet-depth/resolve/main/images/stormtrooper.png")
+ depth_image = controlnet_aux.LeresDetector.from_pretrained("lllyasviel/Annotators")(depth_image)
+ input_images.append(depth_image.resize((args.height, args.width)))
+ elif controlnet == "hed":
+ hed_image = download_image("https://huggingface.co/lllyasviel/sd-controlnet-hed/resolve/main/images/man.png")
+ hed_image = controlnet_aux.HEDdetector.from_pretrained("lllyasviel/Annotators")(hed_image)
+ input_images.append(hed_image.resize((args.height, args.width)))
+ elif controlnet == "mlsd":
+ mlsd_image = download_image("https://huggingface.co/lllyasviel/sd-controlnet-mlsd/resolve/main/images/room.png")
+ mlsd_image = controlnet_aux.MLSDdetector.from_pretrained("lllyasviel/Annotators")(mlsd_image)
+ input_images.append(mlsd_image.resize((args.height, args.width)))
+ elif controlnet == "openpose":
+ openpose_image = download_image("https://huggingface.co/lllyasviel/sd-controlnet-openpose/resolve/main/images/pose.png")
+ openpose_image = controlnet_aux.OpenposeDetector.from_pretrained("lllyasviel/Annotators")(openpose_image)
+ input_images.append(openpose_image.resize((args.height, args.width)))
+ elif controlnet == "scribble":
+ scribble_image = download_image("https://huggingface.co/lllyasviel/sd-controlnet-scribble/resolve/main/images/bag.png")
+ scribble_image = controlnet_aux.HEDdetector.from_pretrained("lllyasviel/Annotators")(scribble_image, scribble=True)
+ input_images.append(scribble_image.resize((args.height, args.width)))
+ elif controlnet == "seg":
+ seg_image = download_image("https://huggingface.co/lllyasviel/sd-controlnet-seg/resolve/main/images/house.png")
+ seg_image = controlnet_aux.SamDetector.from_pretrained("ybelkada/segment-anything", subfolder="checkpoints")(seg_image)
+ input_images.append(seg_image.resize((args.height, args.width)))
+ else:
+ raise ValueError(f"You should implement the conditonal image of this controlnet: {controlnet}")
+ assert len(input_images) > 0
+
+ kwargs_init_pipeline, kwargs_load_engine, args_run_demo = process_pipeline_args(args)
+
+ # Initialize demo
+ demo = StableDiffusionPipeline(
+ pipeline_type=PIPELINE_TYPE.CONTROLNET,
+ controlnets=args.controlnet_type,
+ **kwargs_init_pipeline)
+
+ # Load TensorRT engines and pytorch modules
+ demo.loadEngines(
+ args.engine_dir,
+ args.framework_model_dir,
+ args.onnx_dir,
+ **kwargs_load_engine)
+
+ # Load resources
+ _, shared_device_memory = cudart.cudaMalloc(demo.calculateMaxDeviceMemory())
+ demo.activateEngines(shared_device_memory)
+ demo.loadResources(args.height, args.width, args.batch_size, args.seed)
+
+ # Run inference
+ demo_kwargs = {'input_image': input_images, 'controlnet_scales': controlnet_scale}
+ demo.run(*args_run_demo, **demo_kwargs)
+
+ demo.teardown()
diff --git a/demo/Diffusion/demo_img2img.py b/demo/Diffusion/demo_img2img.py
index 963babee..bf56f6a9 100755
--- a/demo/Diffusion/demo_img2img.py
+++ b/demo/Diffusion/demo_img2img.py
@@ -16,19 +16,24 @@
#
import argparse
-from cuda import cudart
-import tensorrt as trt
-
-from img2img_pipeline import Img2ImgPipeline
-from utilities import preprocess_image, TRT_LOGGER, add_arguments, download_image
import PIL
+from cuda import cudart
from PIL import Image
+from stable_diffusion_pipeline import StableDiffusionPipeline
+from utilities import (
+ PIPELINE_TYPE,
+ TRT_LOGGER,
+ add_arguments,
+ download_image,
+ preprocess_image,
+ process_pipeline_args
+)
+
def parseArgs():
parser = argparse.ArgumentParser(description="Options for Stable Diffusion Img2Img Demo")
parser = add_arguments(parser)
- parser.add_argument('--scheduler', type=str, default="DDIM", choices=["DDIM", "EulerA", "LMSD", "DPM", "PNDM"], help="Scheduler for diffusion process")
parser.add_argument('--input-image', type=str, default="", help="Path to the input image")
return parser.parse_args()
@@ -36,81 +41,42 @@ def parseArgs():
print("[I] Initializing StableDiffusion img2img demo using TensorRT")
args = parseArgs()
- # Process prompt
- if not isinstance(args.prompt, list):
- raise ValueError(f"`prompt` must be of type `str` or `str` list, but is {type(args.prompt)}")
- prompt = args.prompt * args.repeat_prompt
-
- if not isinstance(args.negative_prompt, list):
- raise ValueError(f"`--negative-prompt` must be of type `str` or `str` list, but is {type(args.negative_prompt)}")
- if len(args.negative_prompt) == 1:
- negative_prompt = args.negative_prompt * len(prompt)
- else:
- negative_prompt = args.negative_prompt
-
if args.input_image:
input_image = Image.open(args.input_image)
else:
- url = "https://pajoca.com/wp-content/uploads/2022/09/tekito-yamakawa-1.png"
+ url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
input_image = download_image(url)
image_width, image_height = input_image.size
-
- # Validate image dimensions
- if image_height % 8 != 0 or image_width % 8 != 0:
- raise ValueError(f"Image height and width have to be divisible by 8 but specified as: {image_height} and {image_width}.")
+ if image_height != args.height or image_width != args.width:
+ print(f"[I] Resizing input_image to {args.height}x{args.width}")
+ input_image = input_image.resize((args.height, args.width))
+ image_height, image_width = args.height, args.width
if isinstance(input_image, PIL.Image.Image):
input_image = preprocess_image(input_image)
- # Register TensorRT plugins
- trt.init_libnvinfer_plugins(TRT_LOGGER, '')
-
- max_batch_size = 16
- if args.build_dynamic_shape:
- max_batch_size = 4
-
- batch_size = len(prompt)
- if batch_size > max_batch_size:
- raise ValueError(f"Batch size {len(prompt)} is larger than allowed {max_batch_size}. If dynamic shape is used, then maximum batch size is 4")
-
- if args.use_cuda_graph and (not args.build_static_batch or args.build_dynamic_shape):
- raise ValueError(f"Using CUDA graph requires static dimensions. Enable `--build-static-batch` and do not specify `--build-dynamic-shape`")
+ kwargs_init_pipeline, kwargs_load_engine, args_run_demo = process_pipeline_args(args)
# Initialize demo
- demo = Img2ImgPipeline(
- scheduler=args.scheduler,
- denoising_steps=args.denoising_steps,
- output_dir=args.output_dir,
- version=args.version,
- hf_token=args.hf_token,
- verbose=args.verbose,
- nvtx_profile=args.nvtx_profile,
- max_batch_size=max_batch_size)
+ demo = StableDiffusionPipeline(
+ pipeline_type=PIPELINE_TYPE.IMG2IMG,
+ **kwargs_init_pipeline)
# Load TensorRT engines and pytorch modules
- demo.loadEngines(args.engine_dir, args.onnx_dir, args.onnx_opset,
- opt_batch_size=len(prompt), opt_image_height=image_height, opt_image_width=image_width, \
- force_export=args.force_onnx_export, force_optimize=args.force_onnx_optimize, \
- force_build=args.force_engine_build, \
- static_batch=args.build_static_batch, static_shape=not args.build_dynamic_shape, \
- enable_refit=args.build_enable_refit, enable_preview=args.build_preview_features, enable_all_tactics=args.build_all_tactics, \
- timing_cache=args.timing_cache, onnx_refit_dir=args.onnx_refit_dir)
- demo.loadResources(image_height, image_width, batch_size, args.seed)
-
- if args.use_cuda_graph:
- # inference once to get cuda graph
- images = demo.infer(prompt, negative_prompt, input_image, image_height, image_width, strength=0.75, warmup=True)
-
- print("[I] Warming up ..")
- for _ in range(args.num_warmup_runs):
- images = demo.infer(prompt, negative_prompt, input_image, image_height, image_width, strength=0.75, warmup=True)
-
- print("[I] Running StableDiffusion pipeline")
- if args.nvtx_profile:
- cudart.cudaProfilerStart()
-
- images = demo.infer(prompt, negative_prompt, input_image, image_height, image_width, seed=args.seed, strength=0.75)
-
- if args.nvtx_profile:
- cudart.cudaProfilerStop()
+ demo.loadEngines(
+ args.engine_dir,
+ args.framework_model_dir,
+ args.onnx_dir,
+ **kwargs_load_engine)
+
+ # Load resources
+ _, shared_device_memory = cudart.cudaMalloc(demo.calculateMaxDeviceMemory())
+ demo.activateEngines(shared_device_memory)
+ demo.loadResources(args.height, args.width, args.batch_size, args.seed)
+
+ # Run inference
+ demo_kwargs = {'input_image': input_image, 'image_strength': 0.75}
+ demo.run(*args_run_demo, **demo_kwargs)
+
+ demo.teardown()
diff --git a/demo/Diffusion/demo_inpaint.py b/demo/Diffusion/demo_inpaint.py
index 1fa8219a..af635df0 100755
--- a/demo/Diffusion/demo_inpaint.py
+++ b/demo/Diffusion/demo_inpaint.py
@@ -16,15 +16,17 @@
#
import argparse
+
from cuda import cudart
-import tensorrt as trt
-from utilities import TRT_LOGGER, add_arguments, download_image
-from inpaint_pipeline import InpaintPipeline
from PIL import Image
+from stable_diffusion_pipeline import StableDiffusionPipeline
+from utilities import PIPELINE_TYPE, TRT_LOGGER, add_arguments, download_image, process_pipeline_args
+
def parseArgs():
- parser = argparse.ArgumentParser(description="Options for Stable Diffusion Inpaint Demo")
+ parser = argparse.ArgumentParser(description="Options for Stable Diffusion Inpaint Demo", conflict_handler='resolve')
parser = add_arguments(parser)
+ parser.add_argument('--version', type=str, default="1.5", choices=["1.5", "2.0"], help="Stable Diffusion version. Only 1.5 and 2.0 supported for inpainting.")
parser.add_argument('--scheduler', type=str, default="PNDM", choices=["PNDM"], help="Scheduler for diffusion process")
parser.add_argument('--input-image', type=str, default="", help="Path to the input image")
parser.add_argument('--mask-image', type=str, default="", help="Path to the mask image")
@@ -34,22 +36,6 @@ def parseArgs():
print("[I] Initializing StableDiffusion inpainting demo using TensorRT")
args = parseArgs()
- # Inpainting is currently only supported for v1.5 and v2.0
- if args.version not in ("1.5", "2.0"):
- raise ValueError(f"Inpainting not supported in version {args.version}. Use v2.0, or v1.5")
-
- # Process prompt
- if not isinstance(args.prompt, list):
- raise ValueError(f"`prompt` must be of type `str` or `str` list, but is {type(args.prompt)}")
- prompt = args.prompt * args.repeat_prompt
-
- if not isinstance(args.negative_prompt, list):
- raise ValueError(f"`--negative-prompt` must be of type `str` or `str` list, but is {type(args.negative_prompt)}")
- if len(args.negative_prompt) == 1:
- negative_prompt = args.negative_prompt * len(prompt)
- else:
- negative_prompt = args.negative_prompt
-
if args.input_image:
input_image = Image.open(args.input_image).convert("RGB")
else:
@@ -63,65 +49,38 @@ def parseArgs():
mask_image = download_image(mask_url)
image_width, image_height = input_image.size
- mask_width, mask_height = mask_image.size
-
- # Validate image dimensions
- if mask_height != image_height or mask_width != image_width:
- raise ValueError(f"Input image height and width {image_height} and {image_width} are not equal to "
- f"the respective dimensions of the mask image {mask_height} and {mask_width}")
-
- if image_height % 8 != 0 or image_width % 8 != 0:
- raise ValueError(f"Image height and width have to be divisible by 8 but specified as: {image_height} and {image_width}.")
+ if image_height != args.height or image_width != args.width:
+ print(f"[I] Resizing input_image to {args.height}x{args.width}")
+ input_image = input_image.resize((args.height, args.width))
+ image_height, image_width = args.height, args.width
- # Register TensorRT plugins
- trt.init_libnvinfer_plugins(TRT_LOGGER, '')
-
- max_batch_size = 16
- if args.build_dynamic_shape:
- max_batch_size = 4
-
- batch_size = len(prompt)
- if batch_size > max_batch_size:
- raise ValueError(f"Batch size {len(prompt)} is larger than allowed {max_batch_size}. If dynamic shape is used, then maximum batch size is 4")
+ mask_width, mask_height = mask_image.size
+ if mask_height != args.height or mask_width != args.width:
+ print(f"[I] Resizing mask_image to {args.height}x{args.width}")
+ mask_image = mask_image.resize((args.height, args.width))
+ mask_height, mask_width = args.height, args.width
- if args.use_cuda_graph and (not args.build_static_batch or args.build_dynamic_shape):
- raise ValueError(f"Using CUDA graph requires static dimensions. Enable `--build-static-batch` and do not specify `--build-dynamic-shape`")
+ kwargs_init_pipeline, kwargs_load_engine, args_run_demo = process_pipeline_args(args)
# Initialize demo
- demo = InpaintPipeline(
- scheduler=args.scheduler,
- denoising_steps=args.denoising_steps,
- output_dir=args.output_dir,
- version=args.version,
- hf_token=args.hf_token,
- verbose=args.verbose,
- nvtx_profile=args.nvtx_profile,
- max_batch_size=max_batch_size)
+ demo = StableDiffusionPipeline(
+ pipeline_type=PIPELINE_TYPE.INPAINT,
+ **kwargs_init_pipeline)
# Load TensorRT engines and pytorch modules
- demo.loadEngines(args.engine_dir, args.onnx_dir, args.onnx_opset,
- opt_batch_size=len(prompt), opt_image_height=image_height, opt_image_width=image_width, \
- force_export=args.force_onnx_export, force_optimize=args.force_onnx_optimize, \
- force_build=args.force_engine_build, \
- static_batch=args.build_static_batch, static_shape=not args.build_dynamic_shape, \
- enable_preview=args.build_preview_features, enable_all_tactics=args.build_all_tactics, \
- timing_cache=args.timing_cache)
- demo.loadResources(image_height, image_width, batch_size, args.seed)
-
-
- if args.use_cuda_graph:
- # inference once to get cuda graph
- images = demo.infer(prompt, negative_prompt, input_image, mask_image, image_height, image_width, strength=0.75, warmup=True)
-
- print("[I] Warming up ..")
- for _ in range(args.num_warmup_runs):
- images = demo.infer(prompt, negative_prompt, input_image, mask_image, image_height, image_width, strength=0.75, warmup=True)
-
- print("[I] Running StableDiffusion pipeline")
- if args.nvtx_profile:
- cudart.cudaProfilerStart()
-
- images = demo.infer(prompt, negative_prompt, input_image, mask_image, image_height, image_width, seed=args.seed, strength=0.75)
-
- if args.nvtx_profile:
- cudart.cudaProfilerStop()
+ demo.loadEngines(
+ args.engine_dir,
+ args.framework_model_dir,
+ args.onnx_dir,
+ **kwargs_load_engine)
+
+ # Load resources
+ _, shared_device_memory = cudart.cudaMalloc(demo.calculateMaxDeviceMemory())
+ demo.activateEngines(shared_device_memory)
+ demo.loadResources(args.height, args.width, args.batch_size, args.seed)
+
+ # Run inference
+ demo_kwargs = {'input_image': input_image, 'image_strength': 0.75, 'mask_image': mask_image}
+ demo.run(*args_run_demo, **demo_kwargs)
+
+ demo.teardown()
diff --git a/demo/Diffusion/demo_txt2img.py b/demo/Diffusion/demo_txt2img.py
index 4491c45e..3e33838f 100644
--- a/demo/Diffusion/demo_txt2img.py
+++ b/demo/Diffusion/demo_txt2img.py
@@ -16,89 +16,41 @@
#
import argparse
+
from cuda import cudart
-import tensorrt as trt
-from utilities import TRT_LOGGER, add_arguments
-from txt2img_pipeline import Txt2ImgPipeline
+
+from stable_diffusion_pipeline import StableDiffusionPipeline
+from utilities import PIPELINE_TYPE, TRT_LOGGER, add_arguments, process_pipeline_args
def parseArgs():
parser = argparse.ArgumentParser(description="Options for Stable Diffusion Txt2Img Demo")
parser = add_arguments(parser)
- parser.add_argument('--scheduler', type=str, default="DDIM", choices=["PNDM", "LMSD", "DPM", "DDIM", "EulerA"], help="Scheduler for diffusion process")
return parser.parse_args()
if __name__ == "__main__":
print("[I] Initializing StableDiffusion txt2img demo using TensorRT")
args = parseArgs()
- # Process prompt
- if not isinstance(args.prompt, list):
- raise ValueError(f"`prompt` must be of type `str` or `str` list, but is {type(args.prompt)}")
- prompt = args.prompt * args.repeat_prompt
-
- if not isinstance(args.negative_prompt, list):
- raise ValueError(f"`--negative-prompt` must be of type `str` or `str` list, but is {type(args.negative_prompt)}")
- if len(args.negative_prompt) == 1:
- negative_prompt = args.negative_prompt * len(prompt)
- else:
- negative_prompt = args.negative_prompt
-
- # Validate image dimensions
- image_height = args.height
- image_width = args.width
- if image_height % 8 != 0 or image_width % 8 != 0:
- raise ValueError(f"Image height and width have to be divisible by 8 but specified as: {image_height} and {image_width}.")
-
- # Register TensorRT plugins
- trt.init_libnvinfer_plugins(TRT_LOGGER, '')
-
- max_batch_size = 16
- # FIXME VAE build fails due to element limit. Limitting batch size is WAR
- if args.build_dynamic_shape or image_height > 512 or image_width > 512:
- max_batch_size = 4
-
- batch_size = len(prompt)
- if batch_size > max_batch_size:
- raise ValueError(f"Batch size {len(prompt)} is larger than allowed {max_batch_size}. If dynamic shape is used, then maximum batch size is 4")
-
- if args.use_cuda_graph and (not args.build_static_batch or args.build_dynamic_shape):
- raise ValueError(f"Using CUDA graph requires static dimensions. Enable `--build-static-batch` and do not specify `--build-dynamic-shape`")
+ kwargs_init_pipeline, kwargs_load_engine, args_run_demo = process_pipeline_args(args)
# Initialize demo
- demo = Txt2ImgPipeline(
- scheduler=args.scheduler,
- denoising_steps=args.denoising_steps,
- output_dir=args.output_dir,
- version=args.version,
- hf_token=args.hf_token,
- verbose=args.verbose,
- nvtx_profile=args.nvtx_profile,
- max_batch_size=max_batch_size,
- use_cuda_graph=args.use_cuda_graph)
+ demo = StableDiffusionPipeline(
+ pipeline_type=PIPELINE_TYPE.TXT2IMG,
+ **kwargs_init_pipeline)
# Load TensorRT engines and pytorch modules
- demo.loadEngines(args.engine_dir, args.onnx_dir, args.onnx_opset,
- opt_batch_size=len(prompt), opt_image_height=image_height, opt_image_width=image_width, \
- force_export=args.force_onnx_export, force_optimize=args.force_onnx_optimize, \
- force_build=args.force_engine_build, \
- static_batch=args.build_static_batch, static_shape=not args.build_dynamic_shape, \
- enable_refit=args.build_enable_refit, enable_preview=args.build_preview_features, enable_all_tactics=args.build_all_tactics, \
- timing_cache=args.timing_cache, onnx_refit_dir=args.onnx_refit_dir)
- demo.loadResources(image_height, image_width, batch_size, args.seed)
-
- if args.use_cuda_graph:
- # inference once to get cuda graph
- images = demo.infer(prompt, negative_prompt, image_height, image_width, warmup=True, verbose=False)
-
- print("[I] Warming up ..")
- for _ in range(args.num_warmup_runs):
- images = demo.infer(prompt, negative_prompt, image_height, image_width, warmup=True, verbose=False)
-
- print("[I] Running StableDiffusion pipeline")
- if args.nvtx_profile:
- cudart.cudaProfilerStart()
- images = demo.infer(prompt, negative_prompt, image_height, image_width, seed=args.seed, verbose=args.verbose)
- if args.nvtx_profile:
- cudart.cudaProfilerStop()
+ demo.loadEngines(
+ args.engine_dir,
+ args.framework_model_dir,
+ args.onnx_dir,
+ **kwargs_load_engine)
+
+ # Load resources
+ _, shared_device_memory = cudart.cudaMalloc(demo.calculateMaxDeviceMemory())
+ demo.activateEngines(shared_device_memory)
+ demo.loadResources(args.height, args.width, args.batch_size, args.seed)
+
+ # Run inference
+ demo.run(*args_run_demo)
demo.teardown()
diff --git a/demo/Diffusion/demo_txt2img_xl.py b/demo/Diffusion/demo_txt2img_xl.py
new file mode 100644
index 00000000..ea579279
--- /dev/null
+++ b/demo/Diffusion/demo_txt2img_xl.py
@@ -0,0 +1,151 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import argparse
+
+from cuda import cudart
+
+from stable_diffusion_pipeline import StableDiffusionPipeline
+from utilities import PIPELINE_TYPE, TRT_LOGGER, add_arguments, process_pipeline_args
+
+def parseArgs():
+ parser = argparse.ArgumentParser(description="Options for Stable Diffusion XL Txt2Img Demo", conflict_handler='resolve')
+ parser = add_arguments(parser)
+ parser.add_argument('--version', type=str, default="xl-1.0", choices=["xl-1.0", "xl-turbo"], help="Version of Stable Diffusion XL")
+ parser.add_argument('--height', type=int, default=1024, help="Height of image to generate (must be multiple of 8)")
+ parser.add_argument('--width', type=int, default=1024, help="Height of image to generate (must be multiple of 8)")
+ parser.add_argument('--num-warmup-runs', type=int, default=1, help="Number of warmup runs before benchmarking performance")
+
+ parser.add_argument('--guidance-scale', type=float, default=5.0, help="Value of classifier-free guidance scale (must be greater than 1)")
+
+ parser.add_argument('--enable-refiner', action='store_true', help="Enable SDXL-Refiner model")
+ parser.add_argument('--image-strength', type=float, default=0.3, help="Strength of transformation applied to input_image (must be between 0 and 1)")
+ parser.add_argument('--onnx-refiner-dir', default='onnx_xl_refiner', help="Directory for SDXL-Refiner ONNX models")
+ parser.add_argument('--engine-refiner-dir', default='engine_xl_refiner', help="Directory for SDXL-Refiner TensorRT engines")
+
+ return parser.parse_args()
+
+class StableDiffusionXLPipeline(StableDiffusionPipeline):
+ def __init__(self, vae_scaling_factor=0.13025, enable_refiner=False, **kwargs):
+ self.enable_refiner = enable_refiner
+ self.nvtx_profile = kwargs['nvtx_profile']
+ self.base = StableDiffusionPipeline(
+ pipeline_type=PIPELINE_TYPE.XL_BASE,
+ vae_scaling_factor=vae_scaling_factor,
+ return_latents=self.enable_refiner,
+ **kwargs)
+ if self.enable_refiner:
+ self.refiner = StableDiffusionPipeline(
+ pipeline_type=PIPELINE_TYPE.XL_REFINER,
+ vae_scaling_factor=vae_scaling_factor,
+ return_latents=False,
+ **kwargs)
+
+ def loadEngines(self, framework_model_dir, onnx_dir, engine_dir, onnx_refiner_dir='onnx_xl_refiner', engine_refiner_dir='engine_xl_refiner', **kwargs):
+ self.base.loadEngines(engine_dir, framework_model_dir, onnx_dir, **kwargs)
+ if self.enable_refiner:
+ self.refiner.loadEngines(engine_refiner_dir, framework_model_dir, onnx_refiner_dir, **kwargs)
+
+ def activateEngines(self, shared_device_memory=None):
+ self.base.activateEngines(shared_device_memory)
+ if self.enable_refiner:
+ self.refiner.activateEngines(shared_device_memory)
+
+ def loadResources(self, image_height, image_width, batch_size, seed):
+ self.base.loadResources(image_height, image_width, batch_size, seed)
+ if self.enable_refiner:
+ # Use a different seed for refiner - we arbitrarily use base seed+1, if specified.
+ self.refiner.loadResources(image_height, image_width, batch_size, ((seed+1) if seed is not None else None))
+
+ def get_max_device_memory(self):
+ max_device_memory = self.base.calculateMaxDeviceMemory()
+ if self.enable_refiner:
+ max_device_memory = max(max_device_memory, self.refiner.calculateMaxDeviceMemory())
+ return max_device_memory
+
+ def run(self, prompt, negative_prompt, height, width, batch_size, batch_count, num_warmup_runs, use_cuda_graph, **kwargs_infer_refiner):
+ # Process prompt
+ if not isinstance(prompt, list):
+ raise ValueError(f"`prompt` must be of type `str` list, but is {type(prompt)}")
+ prompt = prompt * batch_size
+
+ if not isinstance(negative_prompt, list):
+ raise ValueError(f"`--negative-prompt` must be of type `str` list, but is {type(negative_prompt)}")
+ if len(negative_prompt) == 1:
+ negative_prompt = negative_prompt * batch_size
+
+ num_warmup_runs = max(1, num_warmup_runs) if use_cuda_graph else num_warmup_runs
+ if num_warmup_runs > 0:
+ print("[I] Warming up ..")
+ for _ in range(num_warmup_runs):
+ images, _ = self.base.infer(prompt, negative_prompt, height, width, warmup=True)
+ if args.enable_refiner:
+ images, _ = self.refiner.infer(prompt, negative_prompt, height, width, input_image=images, warmup=True, **kwargs_infer_refiner)
+
+ ret = []
+ for _ in range(batch_count):
+ print("[I] Running StableDiffusionXL pipeline")
+ if self.nvtx_profile:
+ cudart.cudaProfilerStart()
+ latents, time_base = self.base.infer(prompt, negative_prompt, height, width, warmup=False)
+ if self.enable_refiner:
+ images, time_refiner = self.refiner.infer(prompt, negative_prompt, height, width, input_image=latents, warmup=False, **kwargs_infer_refiner)
+ ret.append(images)
+ else:
+ ret.append(latents)
+
+ if self.nvtx_profile:
+ cudart.cudaProfilerStop()
+ if self.enable_refiner:
+ print('|-----------------|--------------|')
+ print('| {:^15} | {:>9.2f} ms |'.format('e2e', time_base + time_refiner))
+ print('|-----------------|--------------|')
+ return ret
+
+ def teardown(self):
+ self.base.teardown()
+ if self.enable_refiner:
+ self.refiner.teardown()
+
+if __name__ == "__main__":
+ print("[I] Initializing TensorRT accelerated StableDiffusionXL txt2img pipeline")
+ args = parseArgs()
+
+ kwargs_init_pipeline, kwargs_load_engine, args_run_demo = process_pipeline_args(args)
+
+ # Initialize demo
+ demo = StableDiffusionXLPipeline(vae_scaling_factor=0.13025, enable_refiner=args.enable_refiner, **kwargs_init_pipeline)
+
+ # Load TensorRT engines and pytorch modules
+ kwargs_load_refiner = {'onnx_refiner_dir': args.onnx_refiner_dir, 'engine_refiner_dir': args.engine_refiner_dir} if args.enable_refiner else {}
+ demo.loadEngines(
+ args.framework_model_dir,
+ args.onnx_dir,
+ args.engine_dir,
+ **kwargs_load_refiner,
+ **kwargs_load_engine)
+
+ # Load resources
+ _, shared_device_memory = cudart.cudaMalloc(demo.get_max_device_memory())
+ demo.activateEngines(shared_device_memory)
+ demo.loadResources(args.height, args.width, args.batch_size, args.seed)
+
+ # Run inference
+ kwargs_infer_refiner = {'image_strength': args.image_strength} if args.enable_refiner else {}
+ demo.run(*args_run_demo, **kwargs_infer_refiner)
+
+ demo.teardown()
diff --git a/demo/Diffusion/img2img_pipeline.py b/demo/Diffusion/img2img_pipeline.py
deleted file mode 100755
index 2a0b05d1..00000000
--- a/demo/Diffusion/img2img_pipeline.py
+++ /dev/null
@@ -1,115 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-import numpy as np
-import nvtx
-import time
-import torch
-import tensorrt as trt
-from utilities import TRT_LOGGER
-from stable_diffusion_pipeline import StableDiffusionPipeline
-
-class Img2ImgPipeline(StableDiffusionPipeline):
- """
- Application showcasing the acceleration of Stable Diffusion Img2Img v1.4, v1.5, v2.0-base, v2.0, v2.1-base, v2.1 pipeline using NVidia TensorRT w/ Plugins.
- """
- def __init__(
- self,
- scheduler="DDIM",
- *args, **kwargs
- ):
- """
- Initializes the Img2Img Diffusion pipeline.
-
- Args:
- scheduler (str):
- The scheduler to guide the denoising process. Must be one of the [EulerA, DDIM, DPM, LMSD, PNDM].
- """
- super(Img2ImgPipeline, self).__init__(*args, **kwargs, \
- scheduler=scheduler, stages=['vae_encoder', 'clip', 'unet', 'vae'])
-
- def infer(
- self,
- prompt,
- negative_prompt,
- init_image,
- image_height,
- image_width,
- seed=None,
- strength=0.75,
- warmup=False,
- verbose=False
- ):
- """
- Run the diffusion pipeline.
-
- Args:
- prompt (str):
- The text prompt to guide image generation.
- negative_prompt (str):
- The prompt not to guide the image generation.
- init_image (image):
- Input image to be used as input.
- image_height (int):
- Height (in pixels) of the image to be generated. Must be a multiple of 8.
- image_width (int):
- Width (in pixels) of the image to be generated. Must be a multiple of 8.
- seed (int):
- Seed for the random generator
- strength (float):
- How much to transform the input image. Must be between 0 and 1
- warmup (bool):
- Indicate if this is a warmup run.
- verbose (bool):
- Verbose in logging
- """
- batch_size = len(prompt)
- assert len(prompt) == len(negative_prompt)
-
- with torch.inference_mode(), torch.autocast("cuda"), trt.Runtime(TRT_LOGGER):
- torch.cuda.synchronize()
- e2e_tic = time.perf_counter()
-
- # Initialize timesteps
- timesteps, t_start = self.initialize_timesteps(self.denoising_steps, strength)
- latent_timestep = timesteps[:1].repeat(batch_size)
-
- # Pre-process input image
- init_image = self.preprocess_images(batch_size, (init_image,))[0]
-
- # VAE encode init image
- init_latents = self.encode_image(init_image)
-
- # CLIP text encoder
- text_embeddings = self.encode_prompt(prompt, negative_prompt)
-
- # Add noise to latents using timesteps
- noise = torch.randn(init_latents.shape, generator=self.generator, device=self.device, dtype=torch.float32)
- latents = self.scheduler.add_noise(init_latents, noise, t_start, latent_timestep)
-
- # UNet denoiser
- latents = self.denoise_latent(latents, text_embeddings, timesteps=timesteps, step_offset=t_start)
-
- # VAE decode latent
- images = self.decode_latent(latents)
-
- torch.cuda.synchronize()
- e2e_toc = time.perf_counter()
-
- if not warmup:
- self.print_summary(self.denoising_steps, e2e_tic, e2e_toc, vae_enc=True)
- self.save_image(images, 'img2img', prompt)
diff --git a/demo/Diffusion/inpaint_pipeline.py b/demo/Diffusion/inpaint_pipeline.py
deleted file mode 100755
index 3a1ade5a..00000000
--- a/demo/Diffusion/inpaint_pipeline.py
+++ /dev/null
@@ -1,135 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-import numpy as np
-import nvtx
-import time
-import torch
-import tensorrt as trt
-from utilities import prepare_mask_and_masked_image, TRT_LOGGER
-from stable_diffusion_pipeline import StableDiffusionPipeline
-
-class InpaintPipeline(StableDiffusionPipeline):
- """
- Application showcasing the acceleration of Stable Diffusion Inpainting v1.5, v2.0 pipeline using NVidia TensorRT w/ Plugins.
- """
- def __init__(
- self,
- scheduler="PNDM",
- *args, **kwargs
- ):
- """
- Initializes the Inpainting Diffusion pipeline.
-
- Args:
- scheduler (str):
- The scheduler to guide the denoising process. Must be one of the [PNDM].
- """
-
- if scheduler != "PNDM":
- raise ValueError(f"Inpainting only supports PNDM scheduler")
-
- super(InpaintPipeline, self).__init__(*args, **kwargs, \
- inpaint=True, scheduler=scheduler, stages=[ 'vae_encoder', 'clip', 'unet', 'vae'])
-
- def infer(
- self,
- prompt,
- negative_prompt,
- input_image,
- mask_image,
- image_height,
- image_width,
- seed=None,
- strength=0.75,
- warmup = False,
- verbose = False,
- ):
- """
- Run the diffusion pipeline.
-
- Args:
- prompt (str):
- The text prompt to guide image generation.
- negative_prompt (str):
- The prompt not to guide the image generation.
- input_image (image):
- Input image to be inpainted.
- mask_image (image):
- Mask image containg the region to be inpainted.
- image_height (int):
- Height (in pixels) of the image to be generated. Must be a multiple of 8.
- image_width (int):
- Width (in pixels) of the image to be generated. Must be a multiple of 8.
- seed (int):
- Seed for the random generator
- strength (float):
- How much to transform the input image. Must be between 0 and 1
- warmup (bool):
- Indicate if this is a warmup run.
- verbose (bool):
- Enable verbose logging.
- """
- batch_size = len(prompt)
- assert len(prompt) == len(negative_prompt)
-
- # Spatial dimensions of latent tensor
- latent_height = image_height // 8
- latent_width = image_width // 8
-
- with torch.inference_mode(), torch.autocast("cuda"), trt.Runtime(TRT_LOGGER):
- # Pre-initialize latents
- # TODO: unet_channels = 9?
- latents = self.initialize_latents( \
- batch_size=batch_size, \
- unet_channels=4, \
- latent_height=latent_height, \
- latent_width=latent_width
- )
-
- torch.cuda.synchronize()
- e2e_tic = time.perf_counter()
-
- # Pre-process input images
- mask, masked_image = self.preprocess_images(batch_size, prepare_mask_and_masked_image(input_image, mask_image))
- mask = torch.nn.functional.interpolate(mask, size=(latent_height, latent_width))
- mask = torch.cat([mask] * 2)
-
- # Initialize timesteps
- timesteps, t_start = self.initialize_timesteps(self.denoising_steps, strength)
-
- # VAE encode masked image
- masked_latents = self.encode_image(masked_image)
- masked_latents = torch.cat([masked_latents] * 2)
-
- # CLIP text encoder
- text_embeddings = self.encode_prompt(prompt, negative_prompt)
-
- # UNet denoiser
- latents = self.denoise_latent(latents, text_embeddings, timesteps=timesteps, \
- step_offset=t_start, mask=mask, masked_image_latents=masked_latents)
-
- # VAE decode latent
- images = self.decode_latent(latents)
-
- torch.cuda.synchronize()
- e2e_toc = time.perf_counter()
-
- if not warmup:
- self.print_summary(self.denoising_steps, e2e_tic, e2e_toc, vae_enc=True)
- self.save_image(images, 'inpaint', prompt)
-
diff --git a/demo/Diffusion/models.py b/demo/Diffusion/models.py
index bcf69b32..b1a196aa 100644
--- a/demo/Diffusion/models.py
+++ b/demo/Diffusion/models.py
@@ -15,17 +15,31 @@
# limitations under the License.
#
-from collections import OrderedDict
-from copy import deepcopy
-from diffusers.models import AutoencoderKL, UNet2DConditionModel
+from diffusers import DiffusionPipeline
+from diffusers.loaders import LoraLoaderMixin
+from diffusers.models import (
+ AutoencoderKL,
+ ControlNetModel,
+ UNet2DConditionModel
+)
+from diffusers.utils import convert_state_dict_to_diffusers
+import json
import numpy as np
-from onnx import shape_inference
+import onnx
+from onnx import numpy_helper, shape_inference
import onnx_graphsurgeon as gs
+import os
from polygraphy.backend.onnx.loader import fold_constants
+import re
+import tempfile
import torch
-from transformers import CLIPTextModel, CLIPTokenizer
-from cuda import cudart
-import onnx
+import torch.nn.functional as F
+from transformers import (
+ CLIPTextModel,
+ CLIPTextModelWithProjection,
+ CLIPTokenizer
+)
+from utilities import merge_loras
class Optimizer():
def __init__(
@@ -42,8 +56,7 @@ def info(self, prefix):
def cleanup(self, return_onnx=False):
self.graph.cleanup().toposort()
- if return_onnx:
- return gs.export_onnx(self.graph)
+ return gs.export_onnx(self.graph) if return_onnx else self.graph
def select_outputs(self, keep, names=None):
self.graph.outputs = [self.graph.outputs[o] for o in keep]
@@ -60,7 +73,17 @@ def fold_constants(self, return_onnx=False):
def infer_shapes(self, return_onnx=False):
onnx_graph = gs.export_onnx(self.graph)
if onnx_graph.ByteSize() > 2147483648:
- raise TypeError("ERROR: model size exceeds supported 2GB limit")
+ temp_dir = tempfile.TemporaryDirectory().name
+ os.makedirs(temp_dir, exist_ok=True)
+ onnx_orig_path = os.path.join(temp_dir, 'model.onnx')
+ onnx_inferred_path = os.path.join(temp_dir, 'inferred.onnx')
+ onnx.save_model(onnx_graph,
+ onnx_orig_path,
+ save_as_external_data=True,
+ all_tensors_to_one_file=True,
+ convert_attribute=False)
+ onnx.shape_inference.infer_shapes_path(onnx_orig_path, onnx_inferred_path)
+ onnx_graph = onnx.load(onnx_inferred_path)
else:
onnx_graph = shape_inference.infer_shapes(onnx_graph)
@@ -68,24 +91,100 @@ def infer_shapes(self, return_onnx=False):
if return_onnx:
return onnx_graph
-def get_path(version, inpaint=False):
+ def clip_add_hidden_states(self, return_onnx=False):
+ hidden_layers = -1
+ onnx_graph = gs.export_onnx(self.graph)
+ for i in range(len(onnx_graph.graph.node)):
+ for j in range(len(onnx_graph.graph.node[i].output)):
+ name = onnx_graph.graph.node[i].output[j]
+ if "layers" in name:
+ hidden_layers = max(int(name.split(".")[1].split("/")[0]), hidden_layers)
+ for i in range(len(onnx_graph.graph.node)):
+ for j in range(len(onnx_graph.graph.node[i].output)):
+ if onnx_graph.graph.node[i].output[j] == "/text_model/encoder/layers.{}/Add_1_output_0".format(hidden_layers-1):
+ onnx_graph.graph.node[i].output[j] = "hidden_states"
+ for j in range(len(onnx_graph.graph.node[i].input)):
+ if onnx_graph.graph.node[i].input[j] == "/text_model/encoder/layers.{}/Add_1_output_0".format(hidden_layers-1):
+ onnx_graph.graph.node[i].input[j] = "hidden_states"
+ if return_onnx:
+ return onnx_graph
+
+ def fuse_mha_qkv_int8_sq(self):
+ tensors = self.graph.tensors()
+ keys = tensors.keys()
+
+ # mha : fuse QKV QDQ nodes
+ # mhca : fuse KV QDQ nodes
+ q_pat = (
+ "/down_blocks.\\d+/attentions.\\d+/transformer_blocks"
+ ".\\d+/attn\\d+/to_q/input_quantizer/DequantizeLinear_output_0"
+ )
+ k_pat = (
+ "/down_blocks.\\d+/attentions.\\d+/transformer_blocks"
+ ".\\d+/attn\\d+/to_k/input_quantizer/DequantizeLinear_output_0"
+ )
+ v_pat = (
+ "/down_blocks.\\d+/attentions.\\d+/transformer_blocks"
+ ".\\d+/attn\\d+/to_v/input_quantizer/DequantizeLinear_output_0"
+ )
+
+ qs = list(sorted(map(
+ lambda x: x.group(0), # type: ignore
+ filter(lambda x: x is not None, [re.match(q_pat, key) for key in keys]),
+ )))
+ ks = list(sorted(map(
+ lambda x: x.group(0), # type: ignore
+ filter(lambda x: x is not None, [re.match(k_pat, key) for key in keys]),
+ )))
+ vs = list(sorted(map(
+ lambda x: x.group(0), # type: ignore
+ filter(lambda x: x is not None, [re.match(v_pat, key) for key in keys]),
+ )))
+
+ removed = 0
+ assert len(qs) == len(ks) == len(vs), "Failed to collect tensors"
+ for q, k, v in zip(qs, ks, vs):
+ is_mha = all(["attn1" in tensor for tensor in [q, k, v]])
+ is_mhca = all(["attn2" in tensor for tensor in [q, k, v]])
+ assert (is_mha or is_mhca) and (not (is_mha and is_mhca))
+
+ if is_mha:
+ tensors[k].outputs[0].inputs[0] = tensors[q]
+ tensors[v].outputs[0].inputs[0] = tensors[q]
+ del tensors[k]
+ del tensors[v]
+ removed += 2
+ else: # is_mhca
+ tensors[k].outputs[0].inputs[0] = tensors[v]
+ del tensors[k]
+ removed += 1
+ print(f"Removed {removed} QDQ nodes")
+ return removed
+
+
+def get_path(version, pipeline, controlnets=None):
+ if controlnets is not None:
+ return ["lllyasviel/sd-controlnet-" + modality for modality in controlnets]
+
if version == "1.4":
- if inpaint:
+ if pipeline.is_inpaint():
return "runwayml/stable-diffusion-inpainting"
else:
return "CompVis/stable-diffusion-v1-4"
elif version == "1.5":
- if inpaint:
+ if pipeline.is_inpaint():
return "runwayml/stable-diffusion-inpainting"
else:
return "runwayml/stable-diffusion-v1-5"
+ elif version == 'dreamshaper-7':
+ return 'Lykon/dreamshaper-7'
elif version == "2.0-base":
- if inpaint:
+ if pipeline.is_inpaint():
return "stabilityai/stable-diffusion-2-inpainting"
else:
return "stabilityai/stable-diffusion-2-base"
elif version == "2.0":
- if inpaint:
+ if pipeline.is_inpaint():
return "stabilityai/stable-diffusion-2-inpainting"
else:
return "stabilityai/stable-diffusion-2"
@@ -93,35 +192,135 @@ def get_path(version, inpaint=False):
return "stabilityai/stable-diffusion-2-1"
elif version == "2.1-base":
return "stabilityai/stable-diffusion-2-1-base"
+ elif version == 'xl-1.0':
+ if pipeline.is_sd_xl_base():
+ return "stabilityai/stable-diffusion-xl-base-1.0"
+ elif pipeline.is_sd_xl_refiner():
+ return "stabilityai/stable-diffusion-xl-refiner-1.0"
+ else:
+ raise ValueError(f"Unsupported SDXL 1.0 pipeline {pipeline.name}")
+ elif version == 'xl-turbo':
+ if pipeline.is_sd_xl_base():
+ return "stabilityai/sdxl-turbo"
+ else:
+ raise ValueError(f"Unsupported SDXL Turbo pipeline {pipeline.name}")
else:
raise ValueError(f"Incorrect version {version}")
-def get_embedding_dim(version):
- if version in ("1.4", "1.5"):
+def get_clip_embedding_dim(version, pipeline):
+ if version in ("1.4", "1.5", "dreamshaper-7"):
return 768
elif version in ("2.0", "2.0-base", "2.1", "2.1-base"):
return 1024
+ elif version in ("xl-1.0", "xl-turbo") and pipeline.is_sd_xl_base():
+ return 768
else:
- raise ValueError(f"Incorrect version {version}")
+ raise ValueError(f"Invalid version {version} + pipeline {pipeline}")
+
+def get_clipwithproj_embedding_dim(version, pipeline):
+ if version in ("xl-1.0", "xl-turbo"):
+ return 1280
+ else:
+ raise ValueError(f"Invalid version {version} + pipeline {pipeline}")
+
+def get_unet_embedding_dim(version, pipeline):
+ if version in ("1.4", "1.5", "dreamshaper-7"):
+ return 768
+ elif version in ("2.0", "2.0-base", "2.1", "2.1-base"):
+ return 1024
+ elif version in ("xl-1.0", "xl-turbo") and pipeline.is_sd_xl_base():
+ return 2048
+ elif version in ("xl-1.0", "xl-turbo") and pipeline.is_sd_xl_refiner():
+ return 1280
+ else:
+ raise ValueError(f"Invalid version {version} + pipeline {pipeline}")
+
+# FIXME after serialization support for torch.compile is added
+def get_checkpoint_dir(framework_model_dir, version, pipeline, subfolder, torch_inference):
+ return os.path.join(framework_model_dir, version, pipeline, subfolder)
+
+torch_inference_modes = ['default', 'reduce-overhead', 'max-autotune']
+# FIXME update callsites after serialization support for torch.compile is added
+def optimize_checkpoint(model, torch_inference):
+ if not torch_inference or torch_inference == 'eager':
+ return model
+ assert torch_inference in torch_inference_modes
+ return torch.compile(model, mode=torch_inference, dynamic=False, fullgraph=False)
+
+class LoraLoader(LoraLoaderMixin):
+ def __init__(self,
+ paths,
+ ):
+ self.paths = paths
+ self.state_dict = dict()
+ self.network_alphas = dict()
+
+ for path in paths:
+ state_dict, network_alphas = self.lora_state_dict(path)
+ is_correct_format = all("lora" in key for key in state_dict.keys())
+ if not is_correct_format:
+ raise ValueError("Invalid LoRA checkpoint.")
+
+ self.state_dict[path] = state_dict
+ self.network_alphas[path] = network_alphas
+
+ def get_dicts(self,
+ prefix='unet',
+ convert_to_diffusers=False,
+ ):
+ state_dict = dict()
+ network_alphas = dict()
+
+ for path in self.paths:
+ keys = list(self.state_dict[path].keys())
+ if all(key.startswith(('unet', 'text_encoder')) for key in keys):
+ keys = [k for k in keys if k.startswith(prefix)]
+ if keys:
+ print(f"Processing {prefix} LoRA: {path}")
+ state_dict[path] = {k.replace(f"{prefix}.", ""): v for k, v in self.state_dict[path].items() if k in keys}
+
+ network_alphas[path] = None
+ if path in self.network_alphas and self.network_alphas[path] is not None:
+ alpha_keys = [k for k in self.network_alphas[path].keys() if k.startswith(prefix)]
+ network_alphas[path] = {
+ k.replace(f"{prefix}.", ""): v for k, v in self.network_alphas[path].items() if k in alpha_keys
+ }
+
+ else:
+ # Otherwise, we're dealing with the old format.
+ warn_message = "You have saved the LoRA weights using the old format. To convert LoRA weights to the new format, first load them in a dictionary and then create a new dictionary as follows: `new_state_dict = {f'unet.{module_name}': params for module_name, params in old_state_dict.items()}`."
+ print(warn_message)
+
+ return state_dict, network_alphas
+
class BaseModel():
- def __init__(
- self,
- hf_token,
- fp16=False,
+ def __init__(self,
+ version='1.5',
+ pipeline=None,
device='cuda',
+ hf_token='',
verbose=True,
- path="",
+ framework_model_dir='pytorch_model',
+ fp16=False,
+ int8=False,
max_batch_size=16,
- embedding_dim=768,
text_maxlen=77,
+ embedding_dim=768,
):
- self.name = "SD Model"
- self.hf_token = hf_token
- self.fp16 = fp16
+
+ self.name = self.__class__.__name__
+ self.pipeline = pipeline.name
+ self.version = version
+ self.path = get_path(version, pipeline)
self.device = device
+ self.hf_token = hf_token
+ self.hf_safetensor = not (pipeline.is_inpaint() and version in ("1.4", "1.5"))
self.verbose = verbose
- self.path = path
+ self.framework_model_dir = framework_model_dir
+
+ self.fp16 = fp16
+ self.int8 = int8
self.min_batch = 1
self.max_batch = max_batch_size
@@ -130,10 +329,22 @@ def __init__(
self.min_latent_shape = self.min_image_shape // 8
self.max_latent_shape = self.max_image_shape // 8
- self.embedding_dim = embedding_dim
self.text_maxlen = text_maxlen
+ self.embedding_dim = embedding_dim
+ self.extra_output_names = []
+
+ self.lora_dict = None
+
+ def get_pipeline(self):
+ model_opts = {'variant': 'fp16', 'torch_dtype': torch.float16} if self.fp16 else {}
+ return DiffusionPipeline.from_pretrained(
+ self.path,
+ use_safetensors=self.hf_safetensor,
+ use_auth_token=self.hf_token,
+ **model_opts,
+ ).to(self.device)
- def get_model(self):
+ def get_model(self, torch_inference=''):
pass
def get_input_names(self):
@@ -145,7 +356,7 @@ def get_output_names(self):
def get_dynamic_axes(self):
return None
- def get_sample_input(self, batch_size, image_height, image_width):
+ def get_sample_input(self, batch_size, image_height, image_width, static_shape):
pass
def get_input_profile(self, batch_size, image_height, image_width, static_batch, static_shape):
@@ -154,7 +365,108 @@ def get_input_profile(self, batch_size, image_height, image_width, static_batch,
def get_shape_dict(self, batch_size, image_height, image_width):
return None
- def optimize(self, onnx_graph):
+ # Helper utility for ONNX export
+ def export_onnx(
+ self,
+ onnx_path,
+ onnx_opt_path,
+ onnx_opset,
+ opt_image_height,
+ opt_image_width,
+ custom_model=None,
+ enable_lora_merge=False,
+ static_shape=False,
+ ):
+ onnx_opt_graph = None
+ # Export optimized ONNX model (if missing)
+ if not os.path.exists(onnx_opt_path):
+ if not os.path.exists(onnx_path):
+ print(f"[I] Exporting ONNX model: {onnx_path}")
+ def export_onnx(model):
+ if enable_lora_merge:
+ model = merge_loras(model, self.lora_dict, self.lora_alphas, self.lora_scales)
+ inputs = self.get_sample_input(1, opt_image_height, opt_image_width, static_shape)
+ torch.onnx.export(model,
+ inputs,
+ onnx_path,
+ export_params=True,
+ opset_version=onnx_opset,
+ do_constant_folding=True,
+ input_names=self.get_input_names(),
+ output_names=self.get_output_names(),
+ dynamic_axes=self.get_dynamic_axes(),
+ )
+ if custom_model:
+ with torch.inference_mode():
+ export_onnx(custom_model)
+ else:
+ with torch.inference_mode(), torch.autocast("cuda"):
+ export_onnx(self.get_model())
+ else:
+ print(f"[I] Found cached ONNX model: {onnx_path}")
+
+ print(f"[I] Optimizing ONNX model: {onnx_opt_path}")
+ onnx_opt_graph = self.optimize(onnx.load(onnx_path))
+ if onnx_opt_graph.ByteSize() > 2147483648:
+ onnx.save_model(
+ onnx_opt_graph,
+ onnx_opt_path,
+ save_as_external_data=True,
+ all_tensors_to_one_file=True,
+ convert_attribute=False)
+ else:
+ onnx.save(onnx_opt_graph, onnx_opt_path)
+ else:
+ print(f"[I] Found cached optimized ONNX model: {onnx_opt_path} ")
+
+ # Helper utility for weights map
+ def export_weights_map(self, onnx_opt_path, weights_map_path):
+ if not os.path.exists(weights_map_path):
+ onnx_opt_dir = os.path.dirname(onnx_opt_path)
+ onnx_opt_model = onnx.load(onnx_opt_path)
+ state_dict = self.get_model().state_dict()
+ # Create initializer data hashes
+ initializer_hash_mapping = {}
+ for initializer in onnx_opt_model.graph.initializer:
+ initializer_data = numpy_helper.to_array(initializer, base_dir=onnx_opt_dir).astype(np.float16)
+ initializer_hash = hash(initializer_data.data.tobytes())
+ initializer_hash_mapping[initializer.name] = (initializer_hash, initializer_data.shape)
+
+ weights_name_mapping = {}
+ weights_shape_mapping = {}
+ # set to keep track of initializers already added to the name_mapping dict
+ initializers_mapped = set()
+ for wt_name, wt in state_dict.items():
+ # get weight hash
+ wt = wt.cpu().detach().numpy().astype(np.float16)
+ wt_hash = hash(wt.data.tobytes())
+ wt_t_hash = hash(np.transpose(wt).data.tobytes())
+
+ for initializer_name, (initializer_hash, initializer_shape) in initializer_hash_mapping.items():
+ # Due to constant folding, some weights are transposed during export
+ # To account for the transpose op, we compare the initializer hash to the
+ # hash for the weight and its transpose
+ if wt_hash == initializer_hash or wt_t_hash == initializer_hash:
+ # The assert below ensures there is a 1:1 mapping between
+ # PyTorch and ONNX weight names. It can be removed in cases where 1:many
+ # mapping is found and name_mapping[wt_name] = list()
+ assert initializer_name not in initializers_mapped
+ weights_name_mapping[wt_name] = initializer_name
+ initializers_mapped.add(initializer_name)
+ is_transpose = False if wt_hash == initializer_hash else True
+ weights_shape_mapping[wt_name] = (initializer_shape, is_transpose)
+
+ # Sanity check: Were any weights not matched
+ if wt_name not in weights_name_mapping:
+ print(f'[I] PyTorch weight {wt_name} not matched with any ONNX initializer')
+ print(f'[I] {len(weights_name_mapping.keys())} PyTorch weights were matched with ONNX initializers')
+ assert weights_name_mapping.keys() == weights_shape_mapping.keys()
+ with open(weights_map_path, 'w') as fp:
+ json.dump([weights_name_mapping, weights_shape_mapping], fp)
+ else:
+ print(f"[I] Found cached weights map: {weights_map_path} ")
+
+ def optimize(self, onnx_graph, return_onnx=True, **kwargs):
opt = Optimizer(onnx_graph, verbose=self.verbose)
opt.info(self.name + ': original')
opt.cleanup()
@@ -163,7 +475,10 @@ def optimize(self, onnx_graph):
opt.info(self.name + ': fold constants')
opt.infer_shapes()
opt.info(self.name + ': shape inference')
- onnx_opt_graph = opt.cleanup(return_onnx=True)
+ if kwargs.get('fuse_mha_qkv_int8', False):
+ opt.fuse_mha_qkv_int8_sq()
+ opt.info(self.name + ': fuse QKV nodes')
+ onnx_opt_graph = opt.cleanup(return_onnx=return_onnx)
opt.info(self.name + ': finished')
return onnx_opt_graph
@@ -191,28 +506,49 @@ def get_minmax_dims(self, batch_size, image_height, image_width, static_batch, s
max_latent_width = latent_width if static_shape else self.max_latent_shape
return (min_batch, max_batch, min_image_height, max_image_height, min_image_width, max_image_width, min_latent_height, max_latent_height, min_latent_width, max_latent_width)
-class CLIP(BaseModel):
+
+class CLIPModel(BaseModel):
def __init__(self,
- hf_token,
+ version,
+ pipeline,
device,
+ hf_token,
verbose,
- path,
+ framework_model_dir,
max_batch_size,
- embedding_dim
+ embedding_dim,
+ fp16=False,
+ output_hidden_states=False,
+ subfolder="text_encoder",
+ lora_dict=None,
+ lora_alphas=None,
):
- super(CLIP, self).__init__(hf_token, device=device, verbose=verbose, path=path, max_batch_size=max_batch_size, embedding_dim=embedding_dim)
- self.name = "CLIP"
+ super(CLIPModel, self).__init__(version, pipeline, device=device, hf_token=hf_token, verbose=verbose, framework_model_dir=framework_model_dir, fp16=fp16, max_batch_size=max_batch_size, embedding_dim=embedding_dim)
+ self.subfolder = subfolder
- def get_model(self):
- return CLIPTextModel.from_pretrained(self.path,
- subfolder="text_encoder",
- use_auth_token=self.hf_token).to(self.device)
+ # Output the final hidden state
+ if output_hidden_states:
+ self.extra_output_names = ['hidden_states']
+
+ def get_model(self, torch_inference=''):
+ clip_model_dir = get_checkpoint_dir(self.framework_model_dir, self.version, self.pipeline, self.subfolder, torch_inference)
+ if not os.path.exists(clip_model_dir):
+ model = CLIPTextModel.from_pretrained(self.path,
+ subfolder=self.subfolder,
+ use_safetensors=self.hf_safetensor,
+ use_auth_token=self.hf_token).to(self.device)
+ model.save_pretrained(clip_model_dir)
+ else:
+ print(f"[I] Load CLIP pytorch model from: {clip_model_dir}")
+ model = CLIPTextModel.from_pretrained(clip_model_dir).to(self.device)
+ model = optimize_checkpoint(model, torch_inference)
+ return model
def get_input_names(self):
return ['input_ids']
def get_output_names(self):
- return ['text_embeddings', 'pooler_output']
+ return ['text_embeddings']
def get_dynamic_axes(self):
return {
@@ -229,12 +565,15 @@ def get_input_profile(self, batch_size, image_height, image_width, static_batch,
def get_shape_dict(self, batch_size, image_height, image_width):
self.check_dims(batch_size, image_height, image_width)
- return {
+ output = {
'input_ids': (batch_size, self.text_maxlen),
'text_embeddings': (batch_size, self.text_maxlen, self.embedding_dim)
}
+ if 'hidden_states' in self.extra_output_names:
+ output["hidden_states"] = (batch_size, self.text_maxlen, self.embedding_dim)
+ return output
- def get_sample_input(self, batch_size, image_height, image_width):
+ def get_sample_input(self, batch_size, image_height, image_width, static_shape):
self.check_dims(batch_size, image_height, image_width)
return torch.zeros(batch_size, self.text_maxlen, dtype=torch.int32, device=self.device)
@@ -251,97 +590,389 @@ def optimize(self, onnx_graph):
opt.select_outputs([0], names=['text_embeddings']) # rename network output
opt.info(self.name + ': remove output[0]')
opt_onnx_graph = opt.cleanup(return_onnx=True)
+ if 'hidden_states' in self.extra_output_names:
+ opt_onnx_graph = opt.clip_add_hidden_states(return_onnx=True)
+ opt.info(self.name + ': added hidden_states')
opt.info(self.name + ': finished')
return opt_onnx_graph
-def make_CLIP(version, hf_token, device, verbose, max_batch_size, inpaint=False):
- return CLIP(hf_token=hf_token, device=device, verbose=verbose, path=get_path(version, inpaint=inpaint),
- max_batch_size=max_batch_size, embedding_dim=get_embedding_dim(version))
-class UNet(BaseModel):
+class CLIPWithProjModel(CLIPModel):
def __init__(self,
+ version,
+ pipeline,
+ device,
hf_token,
+ verbose,
+ framework_model_dir,
fp16=False,
- device='cuda',
- verbose=True,
- path="",
max_batch_size=16,
- embedding_dim=768,
- text_maxlen=77,
- unet_dim=4
+ output_hidden_states=False,
+ subfolder="text_encoder_2",
+ lora_dict=None,
+ lora_alphas=None,
):
- super(UNet, self).__init__(hf_token, fp16=fp16, device=device, verbose=verbose, path=path, max_batch_size=max_batch_size, embedding_dim=embedding_dim, text_maxlen=text_maxlen)
- self.unet_dim = unet_dim
- self.name = "UNet"
-
- def get_model(self):
- model_opts = {'revision': 'fp16', 'torch_dtype': torch.float16} if self.fp16 else {}
- return UNet2DConditionModel.from_pretrained(self.path,
- subfolder="unet",
- use_auth_token=self.hf_token,
- **model_opts).to(self.device)
+
+ super(CLIPWithProjModel, self).__init__(version, pipeline, device=device, hf_token=hf_token, verbose=verbose, framework_model_dir=framework_model_dir, fp16=fp16, max_batch_size=max_batch_size, embedding_dim=get_clipwithproj_embedding_dim(version, pipeline), output_hidden_states=output_hidden_states)
+ self.subfolder = subfolder
+
+ def get_model(self, torch_inference=''):
+ clip_model_dir = get_checkpoint_dir(self.framework_model_dir, self.version, self.pipeline, self.subfolder, torch_inference)
+ if not os.path.exists(clip_model_dir):
+ model = CLIPTextModelWithProjection.from_pretrained(self.path,
+ subfolder=self.subfolder,
+ use_safetensors=self.hf_safetensor,
+ use_auth_token=self.hf_token).to(self.device)
+ model.save_pretrained(clip_model_dir)
+ else:
+ print(f"[I] Load CLIP pytorch model from: {clip_model_dir}")
+ model = CLIPTextModelWithProjection.from_pretrained(clip_model_dir).to(self.device)
+ model = optimize_checkpoint(model, torch_inference)
+ return model
+
+ def get_shape_dict(self, batch_size, image_height, image_width):
+ self.check_dims(batch_size, image_height, image_width)
+ output = {
+ 'input_ids': (batch_size, self.text_maxlen),
+ 'text_embeddings': (batch_size, self.embedding_dim)
+ }
+ if 'hidden_states' in self.extra_output_names:
+ output["hidden_states"] = (batch_size, self.text_maxlen, self.embedding_dim)
+
+ return output
+
+
+class UNet2DConditionControlNetModel(torch.nn.Module):
+ def __init__(self, unet, controlnets) -> None:
+ super().__init__()
+ self.unet = unet
+ self.controlnets = controlnets
+
+ def forward(self, sample, timestep, encoder_hidden_states, images, controlnet_scales):
+ for i, (image, conditioning_scale, controlnet) in enumerate(zip(images, controlnet_scales, self.controlnets)):
+ down_samples, mid_sample = controlnet(
+ sample,
+ timestep,
+ encoder_hidden_states=encoder_hidden_states,
+ controlnet_cond=image,
+ return_dict=False,
+ )
+
+ down_samples = [
+ down_sample * conditioning_scale
+ for down_sample in down_samples
+ ]
+ mid_sample *= conditioning_scale
+
+ # merge samples
+ if i == 0:
+ down_block_res_samples, mid_block_res_sample = down_samples, mid_sample
+ else:
+ down_block_res_samples = [
+ samples_prev + samples_curr
+ for samples_prev, samples_curr in zip(down_block_res_samples, down_samples)
+ ]
+ mid_block_res_sample += mid_sample
+
+ noise_pred = self.unet(
+ sample,
+ timestep,
+ encoder_hidden_states=encoder_hidden_states,
+ down_block_additional_residuals=down_block_res_samples,
+ mid_block_additional_residual=mid_block_res_sample
+ )
+ return noise_pred
+
+
+class UNetModel(BaseModel):
+ def __init__(self,
+ version,
+ pipeline,
+ device,
+ hf_token,
+ verbose,
+ framework_model_dir,
+ fp16 = False,
+ int8 = False,
+ max_batch_size = 16,
+ text_maxlen = 77,
+ controlnets = None,
+ lora_scales = None,
+ lora_dict = None,
+ lora_alphas = None,
+ do_classifier_free_guidance = False,
+ ):
+
+ super(UNetModel, self).__init__(version, pipeline, device=device, hf_token=hf_token, verbose=verbose, framework_model_dir=framework_model_dir, fp16=fp16, max_batch_size=max_batch_size, text_maxlen=text_maxlen, embedding_dim=get_unet_embedding_dim(version, pipeline))
+ self.subfolder = 'unet'
+ self.controlnets = get_path(version, pipeline, controlnets) if controlnets else None
+ self.unet_dim = (9 if pipeline.is_inpaint() else 4)
+ self.lora_scales = lora_scales
+ self.lora_dict = lora_dict
+ self.lora_alphas = lora_alphas
+ self.xB = 2 if do_classifier_free_guidance else 1 # batch multiplier
+
+ def get_model(self, torch_inference=''):
+ model_opts = {'variant': 'fp16', 'torch_dtype': torch.float16} if self.fp16 else {}
+ if self.controlnets:
+ unet_model = UNet2DConditionModel.from_pretrained(self.path,
+ subfolder=self.subfolder,
+ use_safetensors=self.hf_safetensor,
+ use_auth_token=self.hf_token,
+ **model_opts).to(self.device)
+ cnet_model_opts = {'torch_dtype': torch.float16} if self.fp16 else {}
+ controlnets = torch.nn.ModuleList([ControlNetModel.from_pretrained(path, **cnet_model_opts).to(self.device) for path in self.controlnets])
+ # FIXME - cache UNet2DConditionControlNetModel
+ model = UNet2DConditionControlNetModel(unet_model, controlnets)
+ else:
+ unet_model_dir = get_checkpoint_dir(self.framework_model_dir, self.version, self.pipeline, self.subfolder, torch_inference)
+ if not os.path.exists(unet_model_dir):
+ model = UNet2DConditionModel.from_pretrained(self.path,
+ subfolder=self.subfolder,
+ use_safetensors=self.hf_safetensor,
+ use_auth_token=self.hf_token,
+ **model_opts).to(self.device)
+ model.save_pretrained(unet_model_dir)
+ else:
+ print(f"[I] Load UNet pytorch model from: {unet_model_dir}")
+ model = UNet2DConditionModel.from_pretrained(unet_model_dir).to(self.device)
+ if torch_inference:
+ model.to(memory_format=torch.channels_last)
+ model = optimize_checkpoint(model, torch_inference)
+ return model
def get_input_names(self):
- return ['sample', 'timestep', 'encoder_hidden_states']
+ if self.controlnets is None:
+ return ['sample', 'timestep', 'encoder_hidden_states']
+ else:
+ return ['sample', 'timestep', 'encoder_hidden_states', 'images', 'controlnet_scales']
def get_output_names(self):
return ['latent']
def get_dynamic_axes(self):
+ xB = '2B' if self.xB == 2 else 'B'
+ if self.controlnets is None:
+ return {
+ 'sample': {0: xB, 2: 'H', 3: 'W'},
+ 'encoder_hidden_states': {0: xB},
+ 'latent': {0: xB, 2: 'H', 3: 'W'}
+ }
+ else:
+ return {
+ 'sample': {0: xB, 2: 'H', 3: 'W'},
+ 'encoder_hidden_states': {0: xB},
+ 'images': {1: xB, 3: '8H', 4: '8W'},
+ 'latent': {0: xB, 2: 'H', 3: 'W'}
+ }
+
+ def get_input_profile(self, batch_size, image_height, image_width, static_batch, static_shape):
+ # WAR to enable inference for H/W that are not multiples of 16
+ # If building with Dynamic Shapes: ensure image height and width are not multiples of 16 for ONNX export and TensorRT engine build
+ if not static_shape:
+ image_height = image_height - 8 if image_height % 16 == 0 else image_height
+ image_width = image_width - 8 if image_width % 16 == 0 else image_width
+ latent_height, latent_width = self.check_dims(batch_size, image_height, image_width)
+ min_batch, max_batch, min_image_height, max_image_height, min_image_width, max_image_width, min_latent_height, max_latent_height, min_latent_width, max_latent_width = \
+ self.get_minmax_dims(batch_size, image_height, image_width, static_batch, static_shape)
+ if self.controlnets is None:
+ return {
+ 'sample': [(self.xB*min_batch, self.unet_dim, min_latent_height, min_latent_width), (self.xB*batch_size, self.unet_dim, latent_height, latent_width), (self.xB*max_batch, self.unet_dim, max_latent_height, max_latent_width)],
+ 'encoder_hidden_states': [(self.xB*min_batch, self.text_maxlen, self.embedding_dim), (self.xB*batch_size, self.text_maxlen, self.embedding_dim), (self.xB*max_batch, self.text_maxlen, self.embedding_dim)]
+ }
+ else:
+ return {
+ 'sample': [(self.xB*min_batch, self.unet_dim, min_latent_height, min_latent_width),
+ (self.xB*batch_size, self.unet_dim, latent_height, latent_width),
+ (self.xB*max_batch, self.unet_dim, max_latent_height, max_latent_width)],
+ 'encoder_hidden_states': [(self.xB*min_batch, self.text_maxlen, self.embedding_dim),
+ (self.xB*batch_size, self.text_maxlen, self.embedding_dim),
+ (self.xB*max_batch, self.text_maxlen, self.embedding_dim)],
+ 'images': [(len(self.controlnets), self.xB*min_batch, 3, min_image_height, min_image_width),
+ (len(self.controlnets), self.xB*batch_size, 3, image_height, image_width),
+ (len(self.controlnets), self.xB*max_batch, 3, max_image_height, max_image_width)]
+ }
+
+
+ def get_shape_dict(self, batch_size, image_height, image_width):
+ latent_height, latent_width = self.check_dims(batch_size, image_height, image_width)
+ if self.controlnets is None:
+ return {
+ 'sample': (self.xB*batch_size, self.unet_dim, latent_height, latent_width),
+ 'encoder_hidden_states': (self.xB*batch_size, self.text_maxlen, self.embedding_dim),
+ 'latent': (self.xB*batch_size, 4, latent_height, latent_width)
+ }
+ else:
+ return {
+ 'sample': (self.xB*batch_size, self.unet_dim, latent_height, latent_width),
+ 'encoder_hidden_states': (self.xB*batch_size, self.text_maxlen, self.embedding_dim),
+ 'images': (len(self.controlnets), self.xB*batch_size, 3, image_height, image_width),
+ 'latent': (self.xB*batch_size, 4, latent_height, latent_width)
+ }
+
+ def get_sample_input(self, batch_size, image_height, image_width, static_shape):
+ # WAR to enable inference for H/W that are not multiples of 16
+ # If building with Dynamic Shapes: ensure image height and width are not multiples of 16 for ONNX export and TensorRT engine build
+ if not static_shape:
+ image_height = image_height - 8 if image_height % 16 == 0 else image_height
+ image_width = image_width - 8 if image_width % 16 == 0 else image_width
+ latent_height, latent_width = self.check_dims(batch_size, image_height, image_width)
+ dtype = torch.float16 if self.fp16 else torch.float32
+ if self.controlnets is None:
+ return (
+ torch.randn(batch_size, self.unet_dim, latent_height, latent_width, dtype=torch.float32, device=self.device),
+ torch.tensor([1.], dtype=torch.float32, device=self.device),
+ torch.randn(batch_size, self.text_maxlen, self.embedding_dim, dtype=dtype, device=self.device)
+ )
+ else:
+ return (
+ torch.randn(batch_size, self.unet_dim, latent_height, latent_width, dtype=torch.float32, device=self.device),
+ torch.tensor(999, dtype=torch.float32, device=self.device),
+ torch.randn(batch_size, self.text_maxlen, self.embedding_dim, dtype=dtype, device=self.device),
+ torch.randn(len(self.controlnets), batch_size, 3, image_height, image_width, dtype=dtype, device=self.device),
+ torch.randn(len(self.controlnets), dtype=dtype, device=self.device)
+ )
+
+
+class UNetXLModel(BaseModel):
+ def __init__(self,
+ version,
+ pipeline,
+ device,
+ hf_token,
+ verbose,
+ framework_model_dir,
+ fp16 = False,
+ int8 = False,
+ max_batch_size = 16,
+ text_maxlen = 77,
+ lora_scales = None,
+ lora_dict = None,
+ lora_alphas = None,
+ do_classifier_free_guidance = False,
+ ):
+ super(UNetXLModel, self).__init__(version, pipeline, device=device, hf_token=hf_token, verbose=verbose, framework_model_dir=framework_model_dir, fp16=fp16, max_batch_size=max_batch_size, text_maxlen=text_maxlen, embedding_dim=get_unet_embedding_dim(version, pipeline))
+ self.subfolder = 'unet'
+ self.unet_dim = (9 if pipeline.is_inpaint() else 4)
+ self.time_dim = (5 if pipeline.is_sd_xl_refiner() else 6)
+ self.lora_scales = lora_scales
+ self.lora_dict = lora_dict
+ self.lora_alphas = lora_alphas
+ self.xB = 2 if do_classifier_free_guidance else 1 # batch multiplier
+
+ def get_model(self, torch_inference=''):
+ model_opts = {'variant': 'fp16', 'torch_dtype': torch.float16} if self.fp16 else {}
+ unet_model_dir = get_checkpoint_dir(self.framework_model_dir, self.version, self.pipeline, self.subfolder, torch_inference)
+ if not os.path.exists(unet_model_dir):
+ model = UNet2DConditionModel.from_pretrained(self.path,
+ subfolder=self.subfolder,
+ use_safetensors=self.hf_safetensor,
+ use_auth_token=self.hf_token,
+ **model_opts).to(self.device)
+ # Use default attention processor for ONNX export
+ if not torch_inference:
+ model.set_default_attn_processor()
+ model.save_pretrained(unet_model_dir)
+ else:
+ print(f"[I] Load UNet pytorch model from: {unet_model_dir}")
+ model_load_opts = {'torch_dtype': torch.float16} if self.fp16 else {}
+ model = UNet2DConditionModel.from_pretrained(unet_model_dir, **model_load_opts).to(self.device)
+ model = optimize_checkpoint(model, torch_inference)
+ return model
+
+ def get_input_names(self):
+ return ['sample', 'timestep', 'encoder_hidden_states', 'text_embeds', 'time_ids']
+
+ def get_output_names(self):
+ return ['latent']
+
+ def get_dynamic_axes(self):
+ xB = '2B' if self.xB == 2 else 'B'
return {
- 'sample': {0: '2B', 2: 'H', 3: 'W'},
- 'encoder_hidden_states': {0: '2B'},
- 'latent': {0: '2B', 2: 'H', 3: 'W'}
+ 'sample': {0: xB, 2: 'H', 3: 'W'},
+ 'encoder_hidden_states': {0: xB},
+ 'latent': {0: xB, 2: 'H', 3: 'W'},
+ 'text_embeds': {0: xB},
+ 'time_ids': {0: xB}
}
def get_input_profile(self, batch_size, image_height, image_width, static_batch, static_shape):
+ # WAR to enable inference for H/W that are not multiples of 16
+ # If building with Dynamic Shapes: ensure image height and width are not multiples of 16 for ONNX export and TensorRT engine build
+ if not static_shape:
+ image_height = image_height - 8 if image_height % 16 == 0 else image_height
+ image_width = image_width - 8 if image_width % 16 == 0 else image_width
latent_height, latent_width = self.check_dims(batch_size, image_height, image_width)
min_batch, max_batch, _, _, _, _, min_latent_height, max_latent_height, min_latent_width, max_latent_width = \
self.get_minmax_dims(batch_size, image_height, image_width, static_batch, static_shape)
return {
- 'sample': [(2*min_batch, self.unet_dim, min_latent_height, min_latent_width), (2*batch_size, self.unet_dim, latent_height, latent_width), (2*max_batch, self.unet_dim, max_latent_height, max_latent_width)],
- 'encoder_hidden_states': [(2*min_batch, self.text_maxlen, self.embedding_dim), (2*batch_size, self.text_maxlen, self.embedding_dim), (2*max_batch, self.text_maxlen, self.embedding_dim)]
+ 'sample': [(self.xB*min_batch, self.unet_dim, min_latent_height, min_latent_width), (self.xB*batch_size, self.unet_dim, latent_height, latent_width), (self.xB*max_batch, self.unet_dim, max_latent_height, max_latent_width)],
+ 'encoder_hidden_states': [(self.xB*min_batch, self.text_maxlen, self.embedding_dim), (self.xB*batch_size, self.text_maxlen, self.embedding_dim), (self.xB*max_batch, self.text_maxlen, self.embedding_dim)],
+ 'text_embeds': [(self.xB*min_batch, 1280), (self.xB*batch_size, 1280), (self.xB*max_batch, 1280)],
+ 'time_ids': [(self.xB*min_batch, self.time_dim), (self.xB*batch_size, self.time_dim), (self.xB*max_batch, self.time_dim)]
}
def get_shape_dict(self, batch_size, image_height, image_width):
latent_height, latent_width = self.check_dims(batch_size, image_height, image_width)
return {
- 'sample': (2*batch_size, self.unet_dim, latent_height, latent_width),
- 'encoder_hidden_states': (2*batch_size, self.text_maxlen, self.embedding_dim),
- 'latent': (2*batch_size, 4, latent_height, latent_width)
+ 'sample': (self.xB*batch_size, self.unet_dim, latent_height, latent_width),
+ 'encoder_hidden_states': (self.xB*batch_size, self.text_maxlen, self.embedding_dim),
+ 'latent': (self.xB*batch_size, 4, latent_height, latent_width),
+ 'text_embeds': (self.xB*batch_size, 1280),
+ 'time_ids': (self.xB*batch_size, self.time_dim)
}
- def get_sample_input(self, batch_size, image_height, image_width):
+ def get_sample_input(self, batch_size, image_height, image_width, static_shape):
+ # WAR to enable inference for H/W that are not multiples of 16
+ # If building with Dynamic Shapes: ensure image height and width are not multiples of 16 for ONNX export and TensorRT engine build
+ if not static_shape:
+ image_height = image_height - 8 if image_height % 16 == 0 else image_height
+ image_width = image_width - 8 if image_width % 16 == 0 else image_width
latent_height, latent_width = self.check_dims(batch_size, image_height, image_width)
dtype = torch.float16 if self.fp16 else torch.float32
return (
- torch.randn(2*batch_size, self.unet_dim, latent_height, latent_width, dtype=torch.float32, device=self.device),
+ torch.randn(self.xB*batch_size, self.unet_dim, latent_height, latent_width, dtype=torch.float32, device=self.device),
torch.tensor([1.], dtype=torch.float32, device=self.device),
- torch.randn(2*batch_size, self.text_maxlen, self.embedding_dim, dtype=dtype, device=self.device)
+ torch.randn(self.xB*batch_size, self.text_maxlen, self.embedding_dim, dtype=dtype, device=self.device),
+ {
+ 'added_cond_kwargs': {
+ 'text_embeds': torch.randn(self.xB*batch_size, 1280, dtype=dtype, device=self.device),
+ 'time_ids' : torch.randn(self.xB*batch_size, self.time_dim, dtype=dtype, device=self.device)
+ }
+ }
)
-def make_UNet(version, hf_token, device, verbose, max_batch_size, inpaint=False):
- return UNet(hf_token=hf_token, fp16=True, device=device, verbose=verbose, path=get_path(version, inpaint=inpaint),
- max_batch_size=max_batch_size, embedding_dim=get_embedding_dim(version), unet_dim=(9 if inpaint else 4))
+ def optimize(self, onnx_graph):
+ return super().optimize(onnx_graph, fuse_mha_qkv_int8=True)
-class VAE(BaseModel):
+class VAEModel(BaseModel):
def __init__(self,
- hf_token,
+ version,
+ pipeline,
device,
+ hf_token,
verbose,
- path,
- max_batch_size,
- embedding_dim
+ framework_model_dir,
+ fp16=False,
+ max_batch_size=16,
):
- super(VAE, self).__init__(hf_token, device=device, verbose=verbose, path=path, max_batch_size=max_batch_size, embedding_dim=embedding_dim)
- self.name = "VAE decoder"
+ super(VAEModel, self).__init__(version, pipeline, device=device, hf_token=hf_token, verbose=verbose, framework_model_dir=framework_model_dir, fp16=fp16, max_batch_size=max_batch_size)
+ self.subfolder = 'vae'
- def get_model(self):
- vae = AutoencoderKL.from_pretrained(self.path,
- subfolder="vae",
- use_auth_token=self.hf_token).to(self.device)
- vae.forward = vae.decode
- return vae
+ def get_model(self, torch_inference=''):
+ vae_decoder_model_path = get_checkpoint_dir(self.framework_model_dir, self.version, self.pipeline, self.subfolder, torch_inference)
+ if not os.path.exists(vae_decoder_model_path):
+ model = AutoencoderKL.from_pretrained(self.path,
+ subfolder=self.subfolder,
+ use_safetensors=self.hf_safetensor,
+ use_auth_token=self.hf_token).to(self.device)
+ model.save_pretrained(vae_decoder_model_path)
+ else:
+ print(f"[I] Load VAE decoder pytorch model from: {vae_decoder_model_path}")
+ model = AutoencoderKL.from_pretrained(vae_decoder_model_path).to(self.device)
+ model.forward = model.decode
+ model = optimize_checkpoint(model, torch_inference)
+ return model
def get_input_names(self):
return ['latent']
@@ -370,37 +1001,44 @@ def get_shape_dict(self, batch_size, image_height, image_width):
'images': (batch_size, 3, image_height, image_width)
}
- def get_sample_input(self, batch_size, image_height, image_width):
+ def get_sample_input(self, batch_size, image_height, image_width, static_shape):
latent_height, latent_width = self.check_dims(batch_size, image_height, image_width)
return torch.randn(batch_size, 4, latent_height, latent_width, dtype=torch.float32, device=self.device)
-def make_VAE(version, hf_token, device, verbose, max_batch_size, inpaint=False):
- return VAE(hf_token=hf_token, device=device, verbose=verbose, path=get_path(version, inpaint=inpaint),
- max_batch_size=max_batch_size, embedding_dim=get_embedding_dim(version))
class TorchVAEEncoder(torch.nn.Module):
- def __init__(self, token, device, path):
+ def __init__(self, version, pipeline, hf_token, device, path, framework_model_dir, hf_safetensor=False):
super().__init__()
- self.path = path
- self.vae_encoder = AutoencoderKL.from_pretrained(self.path, subfolder="vae", use_auth_token=token).to(device)
-
+ vae_encoder_model_dir = get_checkpoint_dir(framework_model_dir, version, pipeline, 'vae_encoder', '')
+ if not os.path.exists(vae_encoder_model_dir):
+ self.vae_encoder = AutoencoderKL.from_pretrained(path,
+ subfolder='vae',
+ use_safetensors=hf_safetensor,
+ use_auth_token=hf_token).to(device)
+ self.vae_encoder.save_pretrained(vae_encoder_model_dir)
+ else:
+ print(f"[I] Load VAE encoder pytorch model from: {vae_encoder_model_dir}")
+ self.vae_encoder = AutoencoderKL.from_pretrained(vae_encoder_model_dir).to(device)
+
def forward(self, x):
return self.vae_encoder.encode(x).latent_dist.sample()
-class VAEEncoder(BaseModel):
+
+class VAEEncoderModel(BaseModel):
def __init__(self,
- hf_token,
+ version,
+ pipeline,
device,
+ hf_token,
verbose,
- path,
- max_batch_size,
- embedding_dim
+ framework_model_dir,
+ fp16=False,
+ max_batch_size=16,
):
- super(VAEEncoder, self).__init__(hf_token, device=device, verbose=verbose, path=path, max_batch_size=max_batch_size, embedding_dim=embedding_dim)
- self.name = "VAE encoder"
+ super(VAEEncoderModel, self).__init__(version, pipeline, device=device, hf_token=hf_token, verbose=verbose, framework_model_dir=framework_model_dir, fp16=fp16, max_batch_size=max_batch_size)
- def get_model(self):
- vae_encoder = TorchVAEEncoder(self.hf_token, self.device, self.path)
+ def get_model(self, torch_inference=''):
+ vae_encoder = TorchVAEEncoder(self.version, self.pipeline, self.hf_token, self.device, self.path, self.framework_model_dir, hf_safetensor=self.hf_safetensor)
return vae_encoder
def get_input_names(self):
@@ -434,15 +1072,20 @@ def get_shape_dict(self, batch_size, image_height, image_width):
'latent': (batch_size, 4, latent_height, latent_width)
}
- def get_sample_input(self, batch_size, image_height, image_width):
+ def get_sample_input(self, batch_size, image_height, image_width, static_shape):
self.check_dims(batch_size, image_height, image_width)
return torch.randn(batch_size, 3, image_height, image_width, dtype=torch.float32, device=self.device)
-def make_VAEEncoder(version, hf_token, device, verbose, max_batch_size, inpaint=False):
- return VAEEncoder(hf_token=hf_token, device=device, verbose=verbose, path=get_path(version, inpaint=inpaint),
- max_batch_size=max_batch_size, embedding_dim=get_embedding_dim(version))
-def make_tokenizer(version, hf_token):
- return CLIPTokenizer.from_pretrained(get_path(version),
- subfolder="tokenizer",
- use_auth_token=hf_token)
+def make_tokenizer(version, pipeline, hf_token, framework_model_dir, subfolder="tokenizer", **kwargs):
+ tokenizer_model_dir = get_checkpoint_dir(framework_model_dir, version, pipeline.name, subfolder, '')
+ if not os.path.exists(tokenizer_model_dir):
+ model = CLIPTokenizer.from_pretrained(get_path(version, pipeline),
+ subfolder=subfolder,
+ use_safetensors=pipeline.is_sd_xl(),
+ use_auth_token=hf_token)
+ model.save_pretrained(tokenizer_model_dir)
+ else:
+ print(f"[I] Load tokenizer pytorch model from: {tokenizer_model_dir}")
+ model = CLIPTokenizer.from_pretrained(tokenizer_model_dir)
+ return model
diff --git a/demo/Diffusion/requirements.txt b/demo/Diffusion/requirements.txt
index df5a12a6..4de26381 100644
--- a/demo/Diffusion/requirements.txt
+++ b/demo/Diffusion/requirements.txt
@@ -1,15 +1,17 @@
accelerate
colored
+controlnet_aux==0.0.6
cuda-python
-diffusers==0.14.0
+diffusers==0.26.3
ftfy
matplotlib
nvtx
-onnx==1.13.1
-onnxruntime==1.14.1
---extra-index-url https://pypi.ngc.nvidia.com
-onnx-graphsurgeon==0.3.26
-polygraphy==0.47.1
+onnx==1.15.0
+onnxruntime==1.17.0
+opencv-python==4.8.0.74
scipy
-torch<2.0.0
-transformers==4.26.1
+transformers==4.31.0
+--extra-index-url https://pypi.nvidia.com
+nvidia-ammo==0.7.0
+onnx-graphsurgeon
+polygraphy
diff --git a/demo/Diffusion/stable_diffusion_pipeline.py b/demo/Diffusion/stable_diffusion_pipeline.py
index 7632995a..13bd4156 100755
--- a/demo/Diffusion/stable_diffusion_pipeline.py
+++ b/demo/Diffusion/stable_diffusion_pipeline.py
@@ -15,30 +15,69 @@
# limitations under the License.
#
+import ammo.torch.quantization as atq
+import calibration
from cuda import cudart
-import gc
-from models import make_CLIP, make_tokenizer, make_UNet, make_VAE, make_VAEEncoder
+from diffusers import (
+ DDIMScheduler,
+ DDPMScheduler,
+ EulerDiscreteScheduler,
+ EulerAncestralDiscreteScheduler,
+ LCMScheduler, LMSDiscreteScheduler,
+ PNDMScheduler,
+ UniPCMultistepScheduler,
+)
+from hashlib import md5
+import inspect
+from models import (
+ get_clip_embedding_dim,
+ get_path,
+ LoraLoader,
+ make_tokenizer,
+ CLIPModel,
+ CLIPWithProjModel,
+ UNetModel,
+ UNetXLModel,
+ VAEModel,
+ VAEEncoderModel,
+)
import numpy as np
import nvtx
-import os
+import json
import onnx
-from polygraphy import cuda
+import os
+import pathlib
+import tensorrt as trt
+import time
import torch
-from utilities import Engine, save_image
-from utilities import DPMScheduler, DDIMScheduler, EulerAncestralDiscreteScheduler, LMSDiscreteScheduler, PNDMScheduler
+from typing import Optional, List
+from utilities import (
+ PIPELINE_TYPE,
+ TRT_LOGGER,
+ Engine,
+ filter_func,
+ get_smoothquant_config,
+ get_refit_weights,
+ load_calib_prompts,
+ merge_loras,
+ prepare_mask_and_masked_image,
+ quantize_lvl,
+ replace_lora_layers,
+ save_image,
+ unload_model
+)
class StableDiffusionPipeline:
"""
- Application showcasing the acceleration of Stable Diffusion Txt2Img v1.4, v1.5, v2.0-base, v2.0, v2.1, v2.1-base pipeline using NVidia TensorRT w/ Plugins.
+ Application showcasing the acceleration of Stable Diffusion pipelines using NVidia TensorRT.
"""
def __init__(
self,
- version="2.1",
- inpaint=False,
- stages=['clip','unet','vae'],
+ version='1.5',
+ pipeline_type=PIPELINE_TYPE.TXT2IMG,
max_batch_size=16,
denoising_steps=50,
- scheduler="DDIM",
+ scheduler=None,
guidance_scale=7.5,
device='cuda',
output_dir='.',
@@ -46,6 +85,13 @@ def __init__(
verbose=False,
nvtx_profile=False,
use_cuda_graph=False,
+ vae_scaling_factor=0.18215,
+ framework_model_dir='pytorch_model',
+ controlnets=None,
+ lora_scale: Optional[List[int]] = None,
+ lora_path: Optional[List[str]] = None,
+ return_latents=False,
+ torch_inference='',
):
"""
Initializes the Diffusion pipeline.
@@ -53,17 +99,15 @@ def __init__(
Args:
version (str):
The version of the pipeline. Should be one of [1.4, 1.5, 2.0, 2.0-base, 2.1, 2.1-base]
- inpaint (bool):
- True if inpainting pipeline.
- stages (list):
- Ordered sequence of stages. Options: ['vae_encoder', 'clip','unet','vae']
+ pipeline_type (PIPELINE_TYPE):
+ Type of current pipeline.
max_batch_size (int):
Maximum batch size for dynamic batch engine.
denoising_steps (int):
The number of denoising steps.
More denoising steps usually lead to a higher quality image at the expense of slower inference.
scheduler (str):
- The scheduler to guide the denoising process. Must be one of [DDIM, DPM, EulerA, LMSD, PNDM].
+ The scheduler to guide the denoising process. Must be one of [DDIM, DPM, EulerA, Euler, LCM, LMSD, PNDM].
guidance_scale (float):
Guidance scale is enabled by setting as > 1.
Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality.
@@ -79,126 +123,202 @@ def __init__(
Insert NVTX profiling markers.
use_cuda_graph (bool):
Use CUDA graph to capture engine execution and then launch inference
+ vae_scaling_factor (float):
+ VAE scaling factor
+ framework_model_dir (str):
+ cache directory for framework checkpoints
+ controlnets (str):
+ Which ControlNet/ControlNets to use.
+ return_latents (bool):
+ Skip decoding the image and return latents instead.
+ torch_inference (str):
+ Run inference with PyTorch (using specified compilation mode) instead of TensorRT.
"""
self.denoising_steps = denoising_steps
- assert guidance_scale > 1.0
self.guidance_scale = guidance_scale
+ self.do_classifier_free_guidance = (guidance_scale > 1.0)
+ self.vae_scaling_factor = vae_scaling_factor
self.max_batch_size = max_batch_size
- # Limit the workspace size for systems with GPU memory larger
- # than 6 GiB to silence OOM warnings from TensorRT optimizer.
- _, free_mem, _ = cudart.cudaMemGetInfo()
- GiB = 2 ** 30
- if free_mem > 6*GiB:
- activation_carveout = 4*GiB
- self.max_workspace_size = free_mem - activation_carveout
- else:
- self.max_workspace_size = 0
-
+ self.framework_model_dir = framework_model_dir
self.output_dir = output_dir
+ for directory in [self.framework_model_dir, self.output_dir]:
+ if not os.path.exists(directory):
+ print(f"[I] Create directory: {directory}")
+ pathlib.Path(directory).mkdir(parents=True)
+
self.hf_token = hf_token
self.device = device
self.verbose = verbose
self.nvtx_profile = nvtx_profile
self.version = version
-
- # Schedule options
- sched_opts = {'num_train_timesteps': 1000, 'beta_start': 0.00085, 'beta_end': 0.012}
- if self.version in ("2.0", "2.1"):
- sched_opts['prediction_type'] = 'v_prediction'
+ self.controlnets = controlnets
+
+ # Pipeline type
+ self.pipeline_type = pipeline_type
+ if self.pipeline_type.is_txt2img() or self.pipeline_type.is_controlnet():
+ self.stages = ['clip','unet','vae']
+ elif self.pipeline_type.is_img2img() or self.pipeline_type.is_inpaint():
+ self.stages = ['vae_encoder', 'clip','unet','vae']
+ elif self.pipeline_type.is_sd_xl_base():
+ self.stages = ['clip', 'clip2', 'unetxl']
+ if not return_latents:
+ self.stages.append('vae')
+ elif self.pipeline_type.is_sd_xl_refiner():
+ self.stages = ['clip2', 'unetxl', 'vae']
else:
- sched_opts['prediction_type'] = 'epsilon'
+ raise ValueError(f"Unsupported pipeline {self.pipeline_type.name}.")
+ self.return_latents = return_latents
+
+ # Schedulers
+ map_version_scheduler = {
+ '1.4': 'PNDM',
+ '1.5': 'PNDM',
+ 'dreamshaper-7': 'PNDM',
+ '2.0-base': 'DDIM',
+ '2.0': 'DDIM',
+ '2.1-base': 'PNDM',
+ '2.1': 'DDIM',
+ 'xl-1.0' : 'Euler',
+ 'xl-turbo': 'EulerA'
+ }
+
+ if not scheduler:
+ scheduler = 'UniPC' if self.pipeline_type.is_controlnet() else map_version_scheduler.get(version, 'DDIM')
+ print(f"[I] Autoselected scheduler: {scheduler}")
+
+ def makeScheduler(cls, subfolder="scheduler", **kwargs):
+ return cls.from_pretrained(get_path(self.version, self.pipeline_type), subfolder=subfolder)
if scheduler == "DDIM":
- self.scheduler = DDIMScheduler(device=self.device, **sched_opts)
- elif scheduler == "DPM":
- self.scheduler = DPMScheduler(device=self.device, **sched_opts)
+ self.scheduler = makeScheduler(DDIMScheduler)
+ elif scheduler == "DDPM":
+ self.scheduler = makeScheduler(DDPMScheduler)
elif scheduler == "EulerA":
- self.scheduler = EulerAncestralDiscreteScheduler(device=self.device, **sched_opts)
+ self.scheduler = makeScheduler(EulerAncestralDiscreteScheduler)
+ elif scheduler == "Euler":
+ self.scheduler = makeScheduler(EulerDiscreteScheduler)
+ elif scheduler == "LCM":
+ self.scheduler = makeScheduler(LCMScheduler)
elif scheduler == "LMSD":
- self.scheduler = LMSDiscreteScheduler(device=self.device, **sched_opts)
+ self.scheduler = makeScheduler(LMSDiscreteScheduler)
elif scheduler == "PNDM":
- sched_opts["steps_offset"] = 1
- self.scheduler = PNDMScheduler(device=self.device, **sched_opts)
+ self.scheduler = makeScheduler(PNDMScheduler)
+ elif scheduler == "UniPC":
+ self.scheduler = makeScheduler(UniPCMultistepScheduler)
else:
- raise ValueError(f"Scheduler should be either DDIM, DPM, EulerA, LMSD or PNDM")
+ raise ValueError(f"Unsupported scheduler {scheduler}. Should be either DDIM, DDPM, EulerA, Euler, LCM, LMSD, PNDM, or UniPC.")
- self.stages = stages
- self.inpaint = inpaint
+ self.config = {}
+ if self.pipeline_type.is_sd_xl():
+ self.config['clip_hidden_states'] = True
+ self.torch_inference = torch_inference
self.use_cuda_graph = use_cuda_graph
- # initialized in loadResources()
- self.stream = None
- self.tokenizer = None
# initialized in loadEngines()
self.models = {}
+ self.torch_models = {}
self.engine = {}
self.shared_device_memory = None
+ # initialize lora loader and scales
+ self.lora_loader = None
+ self.lora_scales = dict()
+ if lora_path:
+ self.lora_loader = LoraLoader(lora_path)
+ assert len(lora_path) == len(lora_scale)
+ for i, path in enumerate(lora_path):
+ self.lora_scales[path] = lora_scale[i]
+
+ # initialized in loadResources()
+ self.events = {}
+ self.generator = None
+ self.markers = {}
+ self.seed = None
+ self.stream = None
+ self.tokenizer = None
+
def loadResources(self, image_height, image_width, batch_size, seed):
# Initialize noise generator
- self.generator = torch.Generator(device="cuda").manual_seed(seed) if seed else None
-
- # Pre-compute latent input scales and linear multistep coefficients
- self.scheduler.set_timesteps(self.denoising_steps)
- self.scheduler.configure()
+ if seed:
+ self.seed = seed
+ self.generator = torch.Generator(device="cuda").manual_seed(seed)
# Create CUDA events and stream
- self.events = {}
for stage in ['clip', 'denoise', 'vae', 'vae_encoder']:
- for marker in ['start', 'stop']:
- self.events[stage+'-'+marker] = cudart.cudaEventCreate()[1]
- self.stream = cuda.Stream()
+ self.events[stage] = [cudart.cudaEventCreate()[1], cudart.cudaEventCreate()[1]]
+ self.stream = cudart.cudaStreamCreate()[1]
- # Allocate buffers for TensorRT engine bindings
- for model_name, obj in self.models.items():
- self.engine[model_name].allocate_buffers(shape_dict=obj.get_shape_dict(batch_size, image_height, image_width), device=self.device)
+ # Allocate TensorRT I/O buffers
+ if not self.torch_inference:
+ for model_name, obj in self.models.items():
+ self.engine[model_name].allocate_buffers(shape_dict=obj.get_shape_dict(batch_size, image_height, image_width), device=self.device)
def teardown(self):
for e in self.events.values():
- cudart.cudaEventDestroy(e)
+ cudart.cudaEventDestroy(e[0])
+ cudart.cudaEventDestroy(e[1])
for engine in self.engine.values():
del engine
if self.shared_device_memory:
- self.shared_device_memory.free()
+ cudart.cudaFree(self.shared_device_memory)
- self.stream.free()
+ cudart.cudaStreamDestroy(self.stream)
del self.stream
def cachedModelName(self, model_name):
- if self.inpaint:
+ if self.pipeline_type.is_inpaint():
model_name += '_inpaint'
return model_name
- def getOnnxPath(self, model_name, onnx_dir, opt=True):
- return os.path.join(onnx_dir, self.cachedModelName(model_name)+('.opt' if opt else '')+'.onnx')
+ def getOnnxPath(self, model_name, onnx_dir, opt=True, suffix=''):
+ onnx_model_dir = os.path.join(onnx_dir, self.cachedModelName(model_name)+suffix+('.opt' if opt else ''))
+ os.makedirs(onnx_model_dir, exist_ok=True)
+ return os.path.join(onnx_model_dir, 'model.onnx')
+
+ def getEnginePath(self, model_name, engine_dir, enable_refit=False, suffix=''):
+ return os.path.join(engine_dir, self.cachedModelName(model_name)+suffix+('.refit' if enable_refit else '')+'.trt'+trt.__version__+'.plan')
+
+ def getWeightsMapPath(self, model_name, onnx_dir):
+ onnx_model_dir = os.path.join(onnx_dir, self.cachedModelName(model_name)+'.opt')
+ os.makedirs(onnx_model_dir, exist_ok=True)
+ return os.path.join(onnx_model_dir, 'weights_map.json')
- def getEnginePath(self, model_name, engine_dir):
- return os.path.join(engine_dir, self.cachedModelName(model_name)+'.plan')
+ def getRefitNodesPath(self, model_name, onnx_dir, suffix=''):
+ onnx_model_dir = os.path.join(onnx_dir, self.cachedModelName(model_name)+'.opt')
+ os.makedirs(onnx_model_dir, exist_ok=True)
+ return os.path.join(onnx_model_dir, 'refit'+suffix+'.json')
+
+ def getStateDictPath(self, model_name, onnx_dir, suffix=''):
+ onnx_model_dir = os.path.join(onnx_dir, self.cachedModelName(model_name)+suffix)
+ os.makedirs(onnx_model_dir, exist_ok=True)
+ return os.path.join(onnx_model_dir, 'state_dict.pt')
def loadEngines(
self,
engine_dir,
+ framework_model_dir,
onnx_dir,
onnx_opset,
opt_batch_size,
opt_image_height,
opt_image_width,
- force_export=False,
- force_optimize=False,
- force_build=False,
static_batch=False,
static_shape=True,
enable_refit=False,
- enable_preview=False,
enable_all_tactics=False,
timing_cache=None,
- onnx_refit_dir=None,
+ int8=False,
+ quantization_level=2.5,
+ quantization_percentile=0.4,
+ quantization_alpha=0.6,
+ calibration_steps=384,
+ denoising_steps=50,
):
"""
Build and load engines for TensorRT accelerated inference.
@@ -206,9 +326,11 @@ def loadEngines(
Args:
engine_dir (str):
- Directory to write the TensorRT engines.
+ Directory to store the TensorRT engines.
+ framework_model_dir (str):
+ Directory to store the framework model ckpt.
onnx_dir (str):
- Directory to write the ONNX models.
+ Directory to store the ONNX models.
onnx_opset (int):
ONNX opset version to export the models.
opt_batch_size (int):
@@ -217,113 +339,229 @@ def loadEngines(
Image height to optimize for during engine building. Must be a multiple of 8.
opt_image_width (int):
Image width to optimize for during engine building. Must be a multiple of 8.
- force_export (bool):
- Force re-exporting the ONNX models.
- force_optimize (bool):
- Force re-optimizing the ONNX models.
- force_build (bool):
- Force re-building the TensorRT engine.
static_batch (bool):
Build engine only for specified opt_batch_size.
static_shape (bool):
Build engine only for specified opt_image_height & opt_image_width. Default = True.
enable_refit (bool):
Build engines with refit option enabled.
- enable_preview (bool):
- Enable TensorRT preview features.
enable_all_tactics (bool):
Enable all tactic sources during TensorRT engine builds.
timing_cache (str):
- Path to the timing cache to accelerate build or None
- onnx_refit_dir (str):
- Directory containing refit ONNX models.
+ Path to the timing cache to speed up TensorRT build.
"""
- # Load text tokenizer
- self.tokenizer = make_tokenizer(self.version, self.hf_token)
+ # Create directories if missing
+ for directory in [engine_dir, onnx_dir]:
+ if not os.path.exists(directory):
+ print(f"[I] Create directory: {directory}")
+ pathlib.Path(directory).mkdir(parents=True)
+
+ # Load text tokenizer(s)
+ if not self.pipeline_type.is_sd_xl_refiner():
+ self.tokenizer = make_tokenizer(self.version, self.pipeline_type, self.hf_token, framework_model_dir)
+ if self.pipeline_type.is_sd_xl():
+ self.tokenizer2 = make_tokenizer(self.version, self.pipeline_type, self.hf_token, framework_model_dir, subfolder='tokenizer_2')
# Load pipeline models
- models_args = {'version': self.version, 'hf_token': self.hf_token, 'device': self.device, \
- 'verbose': self.verbose, 'max_batch_size': self.max_batch_size}
- if 'vae_encoder' in self.stages:
- self.models['vae_encoder'] = make_VAEEncoder(inpaint=self.inpaint, **models_args)
+ models_args = {'version': self.version, 'pipeline': self.pipeline_type, 'device': self.device,
+ 'hf_token': self.hf_token, 'verbose': self.verbose, 'framework_model_dir': framework_model_dir,
+ 'max_batch_size': self.max_batch_size}
+
if 'clip' in self.stages:
- self.models['clip'] = make_CLIP(inpaint=self.inpaint, **models_args)
+ subfolder = 'text_encoder'
+ self.models['clip'] = CLIPModel(**models_args, fp16=True, embedding_dim=get_clip_embedding_dim(self.version, self.pipeline_type), output_hidden_states=self.config.get('clip_hidden_states', False), subfolder=subfolder)
+
+ if 'clip2' in self.stages:
+ subfolder = 'text_encoder_2'
+ self.models['clip2'] = CLIPWithProjModel(**models_args, fp16=True, output_hidden_states=self.config.get('clip_hidden_states', False), subfolder=subfolder)
+
+ lora_dict, lora_alphas = (None, None)
if 'unet' in self.stages:
- self.models['unet'] = make_UNet(inpaint=self.inpaint, **models_args)
+ if self.lora_loader:
+ lora_dict, lora_alphas = self.lora_loader.get_dicts('unet')
+ assert len(lora_dict) == len(self.lora_scales)
+ self.models['unet'] = UNetModel(**models_args, fp16=True, controlnets=self.controlnets,
+ lora_scales=self.lora_scales, lora_dict=lora_dict, lora_alphas=lora_alphas, do_classifier_free_guidance=self.do_classifier_free_guidance)
+
+ if 'unetxl' in self.stages:
+ if not self.pipeline_type.is_sd_xl_refiner() and self.lora_loader:
+ lora_dict, lora_alphas = self.lora_loader.get_dicts('unet')
+ assert len(lora_dict) == len(self.lora_scales)
+ self.models['unetxl'] = UNetXLModel(**models_args, fp16=True,
+ lora_scales=self.lora_scales, lora_dict=lora_dict, lora_alphas=lora_alphas, do_classifier_free_guidance=self.do_classifier_free_guidance)
+
+ vae_fp16 = not self.pipeline_type.is_sd_xl()
+
if 'vae' in self.stages:
- self.models['vae'] = make_VAE(inpaint=self.inpaint, **models_args)
+ self.models['vae'] = VAEModel(**models_args, fp16=vae_fp16)
+
+ if 'vae_encoder' in self.stages:
+ self.models['vae_encoder'] = VAEEncoderModel(**models_args, fp16=vae_fp16)
+
+ # Configure pipeline models to load
+ model_names = self.models.keys()
+ lora_suffix = '-'+'-'.join([str(md5(path.encode('utf-8')).hexdigest())+'-'+('%.2f' % self.lora_scales[path]) for path in sorted(self.lora_loader.paths)]) if self.lora_loader else ''
+ # Enable refit and LoRA merging only for UNet & UNetXL for now
+ do_engine_refit = dict(zip(model_names, [not self.pipeline_type.is_sd_xl_refiner() and enable_refit and model_name.startswith('unet') for model_name in model_names]))
+ do_lora_merge = dict(zip(model_names, [not enable_refit and self.lora_loader and model_name.startswith('unet') for model_name in model_names]))
+ # Torch fallback for VAE if specified
+ torch_fallback = dict(zip(model_names, [self.torch_inference for model_name in model_names]))
+ model_suffix = dict(zip(model_names, [lora_suffix if do_lora_merge[model_name] else '' for model_name in model_names]))
+ use_int8 = dict.fromkeys(model_names, False)
+ if int8:
+ assert self.pipeline_type.is_sd_xl(), "int8 quantization only supported for SDXL pipeline"
+ use_int8['unetxl'] = True
+ model_suffix['unetxl'] += f"-int8.l{quantization_level}.bs2.s{denoising_steps}.c{calibration_steps}.p{quantization_percentile}.a{quantization_alpha}"
+ onnx_path = dict(zip(model_names, [self.getOnnxPath(model_name, onnx_dir, opt=False, suffix=model_suffix[model_name]) for model_name in model_names]))
+ onnx_opt_path = dict(zip(model_names, [self.getOnnxPath(model_name, onnx_dir, suffix=model_suffix[model_name]) for model_name in model_names]))
+ engine_path = dict(zip(model_names, [self.getEnginePath(model_name, engine_dir, do_engine_refit[model_name], suffix=model_suffix[model_name]) for model_name in model_names]))
+ weights_map_path = dict(zip(model_names, [(self.getWeightsMapPath(model_name, onnx_dir) if do_engine_refit[model_name] else None) for model_name in model_names]))
- # Export models to ONNX
for model_name, obj in self.models.items():
- engine_path = self.getEnginePath(model_name, engine_dir)
- if force_export or force_build or not os.path.exists(engine_path):
- onnx_path = self.getOnnxPath(model_name, onnx_dir, opt=False)
- onnx_opt_path = self.getOnnxPath(model_name, onnx_dir)
- if force_export or not os.path.exists(onnx_opt_path):
- if force_export or not os.path.exists(onnx_path):
- print(f"Exporting model: {onnx_path}")
- model = obj.get_model()
- with torch.inference_mode(), torch.autocast("cuda"):
- inputs = obj.get_sample_input(opt_batch_size, opt_image_height, opt_image_width)
- torch.onnx.export(model,
- inputs,
- onnx_path,
- export_params=True,
- opset_version=onnx_opset,
- do_constant_folding=True,
- input_names=obj.get_input_names(),
- output_names=obj.get_output_names(),
- dynamic_axes=obj.get_dynamic_axes(),
+ if torch_fallback[model_name]:
+ continue
+ # Export models to ONNX and save weights name mapping
+ do_export_onnx = not os.path.exists(engine_path[model_name]) and not os.path.exists(onnx_opt_path[model_name])
+ do_export_weights_map = weights_map_path[model_name] and not os.path.exists(weights_map_path[model_name])
+ if do_export_onnx or do_export_weights_map:
+ # Non-quantized ONNX export
+ if not use_int8[model_name]:
+ obj.export_onnx(onnx_path[model_name], onnx_opt_path[model_name], onnx_opset, opt_image_height, opt_image_width, enable_lora_merge=do_lora_merge[model_name], static_shape=static_shape)
+ else:
+ state_dict_path = self.getStateDictPath(model_name, onnx_dir, suffix=model_suffix[model_name])
+ if not os.path.exists(state_dict_path):
+ print(f"[I] Calibrated weights not found, generating {state_dict_path}")
+ pipeline = obj.get_pipeline()
+ model = pipeline.unet
+ replace_lora_layers(model)
+ calibration_file = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'calibration-prompts.txt')
+ # Use batch_size = 2 for UNet calibration
+ calibration_prompts = load_calib_prompts(2, calibration_file)
+ # TODO check size > calibration_steps
+ quant_config = get_smoothquant_config(model, quantization_level)
+ if quantization_percentile is not None:
+ quant_config["percentile"] = quantization_percentile
+ quant_config["base-step"] = int(denoising_steps)
+
+ atq.replace_quant_module(model)
+ atq.set_quantizer_by_cfg(model, quant_config["quant_cfg"])
+ if quantization_percentile is not None:
+ calibration.precentile_calib_mode(base_unet=model, quant_config=quant_config)
+ if quantization_alpha is not None:
+ calibration.reg_alpha_qkv(base_unet=model, alpha=quantization_alpha)
+
+ def do_calibrate(base, calibration_prompts, **kwargs):
+ for i_th, prompts in enumerate(calibration_prompts):
+ if i_th >= kwargs["calib_size"]:
+ return
+ base(
+ prompt=prompts,
+ num_inference_steps=kwargs["n_steps"],
+ negative_prompt=[
+ "normal quality, low quality, worst quality, low res, blurry, nsfw, nude"
+ ]
+ * len(prompts),
+ ).images
+
+ def calibration_loop():
+ do_calibrate(
+ base=pipeline,
+ calibration_prompts=calibration_prompts,
+ calib_size=calibration_steps,
+ n_steps=denoising_steps,
)
- del model
- torch.cuda.empty_cache()
- gc.collect()
- else:
- print(f"Found cached model: {onnx_path}")
- # Optimize onnx
- if force_optimize or not os.path.exists(onnx_opt_path):
- print(f"Generating optimizing model: {onnx_opt_path}")
- onnx_opt_graph = obj.optimize(onnx.load(onnx_path))
- onnx.save(onnx_opt_graph, onnx_opt_path)
+ print(f"[I] Performing int8 calibration for {calibration_steps} steps. This can take a long time.")
+ calibration.calibrate(model, quant_config["algorithm"], forward_loop=calibration_loop)
+ torch.save(model.state_dict(), state_dict_path)
+
+ print(f"[I] Generaing quantized ONNX model: {onnx_opt_path[model_name]}")
+ if not os.path.exists(onnx_path[model_name]):
+ model = obj.get_model()
+ replace_lora_layers(model)
+ atq.replace_quant_module(model)
+ quant_config = atq.INT8_DEFAULT_CFG
+ atq.set_quantizer_by_cfg(model, quant_config["quant_cfg"])
+ model.load_state_dict(torch.load(state_dict_path), strict=True)
+ quantize_lvl(model, quantization_level)
+ atq.disable_quantizer(model, filter_func)
+ model.to(torch.float32) # QDQ needs to be in FP32
else:
- print(f"Found cached optimized model: {onnx_opt_path} ")
+ model = None
+ obj.export_onnx(onnx_path[model_name], onnx_opt_path[model_name], onnx_opset, opt_image_height, opt_image_width, custom_model=model)
+
+ # FIXME do_export_weights_map needs ONNX graph
+ if do_export_weights_map:
+ print(f"[I] Saving weights map: {weights_map_path[model_name]}")
+ obj.export_weights_map(onnx_opt_path[model_name], weights_map_path[model_name])
# Build TensorRT engines
for model_name, obj in self.models.items():
- engine_path = self.getEnginePath(model_name, engine_dir)
- engine = Engine(engine_path)
- onnx_path = self.getOnnxPath(model_name, onnx_dir, opt=False)
- onnx_opt_path = self.getOnnxPath(model_name, onnx_dir)
-
- if force_build or not os.path.exists(engine.engine_path):
- engine.build(onnx_opt_path,
- fp16=True,
+ if torch_fallback[model_name]:
+ continue
+ engine = Engine(engine_path[model_name])
+ if not os.path.exists(engine_path[model_name]):
+ update_output_names = obj.get_output_names() + obj.extra_output_names if obj.extra_output_names else None
+ extra_build_args = {'verbose': self.verbose}
+ if use_int8[model_name]:
+ extra_build_args['int8'] = True
+ extra_build_args['precision_constraints'] = 'prefer'
+ extra_build_args['builder_optimization_level'] = 4
+ fp16amp = obj.fp16
+ engine.build(onnx_opt_path[model_name],
+ fp16=fp16amp,
input_profile=obj.get_input_profile(
opt_batch_size, opt_image_height, opt_image_width,
static_batch=static_batch, static_shape=static_shape
),
- enable_refit=enable_refit,
- enable_preview=enable_preview,
+ enable_refit=do_engine_refit[model_name],
enable_all_tactics=enable_all_tactics,
timing_cache=timing_cache,
- workspace_size=self.max_workspace_size)
+ update_output_names=update_output_names,
+ **extra_build_args)
self.engine[model_name] = engine
- # Load and activate TensorRT engines
- max_device_memory = 0
+ # Load TensorRT engines
for model_name, obj in self.models.items():
- engine = self.engine[model_name]
- engine.load()
+ if torch_fallback[model_name]:
+ continue
+ self.engine[model_name].load()
+ if do_engine_refit[model_name] and obj.lora_dict:
+ assert weights_map_path[model_name]
+ with open(weights_map_path[model_name], 'r') as fp_wts:
+ print(f"[I] Loading weights map: {weights_map_path[model_name]} ")
+ [weights_name_mapping, weights_shape_mapping] = json.load(fp_wts)
+ refit_weights_path = self.getRefitNodesPath(model_name, engine_dir, suffix=lora_suffix)
+ if not os.path.exists(refit_weights_path):
+ print(f"[I] Saving refit weights: {refit_weights_path}")
+ model = merge_loras(obj.get_model(), obj.lora_dict, obj.lora_alphas, obj.lora_scales)
+ refit_weights = get_refit_weights(model.state_dict(), onnx_opt_path[model_name], weights_name_mapping, weights_shape_mapping)
+ torch.save(refit_weights, refit_weights_path)
+ unload_model(model)
+ else:
+ print(f"[I] Loading refit weights: {refit_weights_path}")
+ refit_weights = torch.load(refit_weights_path)
+ self.engine[model_name].refit(refit_weights, obj.fp16)
+
+ # Load torch models
+ for model_name, obj in self.models.items():
+ if torch_fallback[model_name]:
+ self.torch_models[model_name] = obj.get_model(torch_inference=self.torch_inference)
+
+ def calculateMaxDeviceMemory(self):
+ max_device_memory = 0
+ for model_name, engine in self.engine.items():
max_device_memory = max(max_device_memory, engine.engine.device_memory_size)
- if onnx_refit_dir:
- onnx_refit_path = self.getOnnxPath(model_name, onnx_refit_dir)
- if os.path.exists(onnx_refit_path):
- engine.refit(onnx_opt_path, onnx_refit_path)
+ return max_device_memory
- self.shared_device_memory = cuda.DeviceArray.raw((max_device_memory,))
+ def activateEngines(self, shared_device_memory=None):
+ if shared_device_memory is None:
+ max_device_memory = self.calculateMaxDeviceMemory()
+ _, shared_device_memory = cudart.cudaMalloc(max_device_memory)
+ self.shared_device_memory = shared_device_memory
+ # Load and activate TensorRT engines
for engine in self.engine.values():
- engine.activate(reuse_device_memory=self.shared_device_memory.ptr)
+ engine.activate(reuse_device_memory=self.shared_device_memory)
def runEngine(self, model_name, feed_dict):
engine = self.engine[model_name]
@@ -337,147 +575,436 @@ def initialize_latents(self, batch_size, unet_channels, latent_height, latent_wi
latents = latents * self.scheduler.init_noise_sigma
return latents
- def initialize_timesteps(self, timesteps, strength):
- self.scheduler.set_timesteps(timesteps)
- offset = self.scheduler.steps_offset if hasattr(self.scheduler, "steps_offset") else 0
- init_timestep = int(timesteps * strength) + offset
- init_timestep = min(init_timestep, timesteps)
- t_start = max(timesteps - init_timestep + offset, 0)
- timesteps = self.scheduler.timesteps[t_start:].to(self.device)
- return timesteps, t_start
+ def profile_start(self, name, color='blue'):
+ if self.nvtx_profile:
+ self.markers[name] = nvtx.start_range(message=name, color=color)
+ if name in self.events:
+ cudart.cudaEventRecord(self.events[name][0], 0)
- def preprocess_images(self, batch_size, images=()):
+ def profile_stop(self, name):
+ if name in self.events:
+ cudart.cudaEventRecord(self.events[name][1], 0)
if self.nvtx_profile:
- nvtx_image_preprocess = nvtx.start_range(message='image_preprocess', color='pink')
- init_images=[]
+ nvtx.end_range(self.markers[name])
+
+ def preprocess_images(self, batch_size, images=()):
+ if not images:
+ return ()
+ self.profile_start('preprocess', color='pink')
+ input_images=[]
for image in images:
image = image.to(self.device).float()
- image = image.repeat(batch_size, 1, 1, 1)
- init_images .append(image)
- if self.nvtx_profile:
- nvtx.end_range(nvtx_image_preprocess)
- return tuple(init_images)
+ if image.shape[0] != batch_size:
+ image = image.repeat(batch_size, 1, 1, 1)
+ input_images.append(image)
+ self.profile_stop('preprocess')
+ return tuple(input_images)
+
+ def preprocess_controlnet_images(self, batch_size, images=None):
+ '''
+ images: List of PIL.Image.Image
+ '''
+ if images is None:
+ return None
+ self.profile_start('preprocess', color='pink')
+ images = [(np.array(i.convert("RGB")).astype(np.float32) / 255.0)[..., None].transpose(3, 2, 0, 1).repeat(batch_size, axis=0) for i in images]
+ # do_classifier_free_guidance
+ images = [torch.cat([torch.from_numpy(i).to(self.device).float()] * 2) for i in images]
+ images = torch.cat([image[None, ...] for image in images], dim=0)
+ self.profile_stop('preprocess')
+ return images
- def encode_prompt(self, prompt, negative_prompt):
- if self.nvtx_profile:
- nvtx_clip = nvtx.start_range(message='clip', color='green')
- cudart.cudaEventRecord(self.events['clip-start'], 0)
+ def encode_prompt(self, prompt, negative_prompt, encoder='clip', pooled_outputs=False, output_hidden_states=False):
+ self.profile_start('clip', color='green')
+
+ tokenizer = self.tokenizer2 if encoder == 'clip2' else self.tokenizer
+
+ def tokenize(prompt, output_hidden_states):
+ text_input_ids = tokenizer(
+ prompt,
+ padding="max_length",
+ max_length=tokenizer.model_max_length,
+ truncation=True,
+ return_tensors="pt",
+ ).input_ids.type(torch.int32).to(self.device)
+
+ text_hidden_states = None
+ if self.torch_inference:
+ outputs = self.torch_models[encoder](text_input_ids, output_hidden_states=output_hidden_states)
+ text_embeddings = outputs[0].clone()
+ if output_hidden_states:
+ text_hidden_states = outputs['hidden_states'][-2].clone()
+ else:
+ # NOTE: output tensor for CLIP must be cloned because it will be overwritten when called again for negative prompt
+ outputs = self.runEngine(encoder, {'input_ids': text_input_ids})
+ text_embeddings = outputs['text_embeddings'].clone()
+ if output_hidden_states:
+ text_hidden_states = outputs['hidden_states'].clone()
+ return text_embeddings, text_hidden_states
# Tokenize prompt
- text_input_ids = self.tokenizer(
- prompt,
- padding="max_length",
- max_length=self.tokenizer.model_max_length,
- truncation=True,
- return_tensors="pt",
- ).input_ids.type(torch.int32).to(self.device)
-
- text_input_ids_inp = text_input_ids
- # NOTE: output tensor for CLIP must be cloned because it will be overwritten when called again for negative prompt
- text_embeddings = self.runEngine('clip', {"input_ids": text_input_ids_inp})['text_embeddings'].clone()
-
- # Tokenize negative prompt
- uncond_input_ids = self.tokenizer(
- negative_prompt,
- padding="max_length",
- max_length=self.tokenizer.model_max_length,
- truncation=True,
- return_tensors="pt",
- ).input_ids.type(torch.int32).to(self.device)
- uncond_input_ids_inp = uncond_input_ids
- uncond_embeddings = self.runEngine('clip', {"input_ids": uncond_input_ids_inp})['text_embeddings']
-
- # Concatenate the unconditional and text embeddings into a single batch to avoid doing two forward passes for classifier free guidance
- text_embeddings = torch.cat([uncond_embeddings, text_embeddings]).to(dtype=torch.float16)
-
- cudart.cudaEventRecord(self.events['clip-stop'], 0)
- if self.nvtx_profile:
- nvtx.end_range(nvtx_clip)
+ text_embeddings, text_hidden_states = tokenize(prompt, output_hidden_states)
- return text_embeddings
+ if self.do_classifier_free_guidance:
+ # Tokenize negative prompt
+ uncond_embeddings, uncond_hidden_states = tokenize(negative_prompt, output_hidden_states)
- def denoise_latent(self, latents, text_embeddings, timesteps=None, step_offset=0, mask=None, masked_image_latents=None):
- cudart.cudaEventRecord(self.events['denoise-start'], 0)
- if not isinstance(timesteps, torch.Tensor):
- timesteps = self.scheduler.timesteps
- for step_index, timestep in enumerate(timesteps):
- if self.nvtx_profile:
- nvtx_latent_scale = nvtx.start_range(message='latent_scale', color='pink')
+ # Concatenate the unconditional and text embeddings into a single batch to avoid doing two forward passes for classifier free guidance
+ text_embeddings = torch.cat([uncond_embeddings, text_embeddings]).to(dtype=torch.float16)
- # Expand the latents if we are doing classifier free guidance
- latent_model_input = torch.cat([latents] * 2)
- latent_model_input = self.scheduler.scale_model_input(latent_model_input, step_offset + step_index, timestep)
- if isinstance(mask, torch.Tensor):
- latent_model_input = torch.cat([latent_model_input, mask, masked_image_latents], dim=1)
- if self.nvtx_profile:
- nvtx.end_range(nvtx_latent_scale)
-
- # Predict the noise residual
- if self.nvtx_profile:
- nvtx_unet = nvtx.start_range(message='unet', color='blue')
-
- embeddings_dtype = np.float16
- timestep_float = timestep.float() if timestep.dtype != torch.float32 else timestep
-
- sample_inp = latent_model_input
- timestep_inp = timestep_float
- embeddings_inp = text_embeddings
- noise_pred = self.runEngine('unet', {"sample": sample_inp, "timestep": timestep_inp, "encoder_hidden_states": embeddings_inp})['latent']
- if self.nvtx_profile:
- nvtx.end_range(nvtx_unet)
-
- if self.nvtx_profile:
- nvtx_latent_step = nvtx.start_range(message='latent_step', color='pink')
+ if pooled_outputs:
+ pooled_output = text_embeddings
- # Perform guidance
- noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
- noise_pred = noise_pred_uncond + self.guidance_scale * (noise_pred_text - noise_pred_uncond)
+ if output_hidden_states:
+ text_embeddings = torch.cat([uncond_hidden_states, text_hidden_states]).to(dtype=torch.float16) if self.do_classifier_free_guidance else text_hidden_states
- latents = self.scheduler.step(noise_pred, latents, step_offset + step_index, timestep)
-
- if self.nvtx_profile:
- nvtx.end_range(nvtx_latent_step)
+ self.profile_stop('clip')
+ if pooled_outputs:
+ return text_embeddings, pooled_output
+ return text_embeddings
- latents = 1. / 0.18215 * latents
- cudart.cudaEventRecord(self.events['denoise-stop'], 0)
+ # from diffusers (get_timesteps)
+ def get_timesteps(self, num_inference_steps, strength, denoising_start=None):
+ # get the original timestep using init_timestep
+ if denoising_start is None:
+ init_timestep = min(int(num_inference_steps * strength), num_inference_steps)
+ t_start = max(num_inference_steps - init_timestep, 0)
+ else:
+ t_start = 0
+
+ timesteps = self.scheduler.timesteps[t_start * self.scheduler.order :]
+
+ # Strength is irrelevant if we directly request a timestep to start at;
+ # that is, strength is determined by the denoising_start instead.
+ if denoising_start is not None:
+ discrete_timestep_cutoff = int(
+ round(
+ self.scheduler.config.num_train_timesteps
+ - (denoising_start * self.scheduler.config.num_train_timesteps)
+ )
+ )
+
+ num_inference_steps = (timesteps < discrete_timestep_cutoff).sum().item()
+ if self.scheduler.order == 2 and num_inference_steps % 2 == 0:
+ # if the scheduler is a 2nd order scheduler we might have to do +1
+ # because `num_inference_steps` might be even given that every timestep
+ # (except the highest one) is duplicated. If `num_inference_steps` is even it would
+ # mean that we cut the timesteps in the middle of the denoising step
+ # (between 1st and 2nd devirative) which leads to incorrect results. By adding 1
+ # we ensure that the denoising process always ends after the 2nd derivate step of the scheduler
+ num_inference_steps = num_inference_steps + 1
+
+ # because t_n+1 >= t_n, we slice the timesteps starting from the end
+ timesteps = timesteps[-num_inference_steps:]
+ return timesteps, num_inference_steps
+
+ return timesteps, num_inference_steps - t_start
+
+ def denoise_latent(self,
+ latents,
+ text_embeddings,
+ denoiser='unet',
+ timesteps=None,
+ step_offset=0,
+ mask=None,
+ masked_image_latents=None,
+ image_guidance=1.5,
+ controlnet_imgs=None,
+ controlnet_scales=None,
+ text_embeds=None,
+ time_ids=None):
+
+ assert image_guidance > 1.0, "Image guidance has to be > 1.0"
+
+ controlnet_imgs = self.preprocess_controlnet_images(latents.shape[0], controlnet_imgs)
+
+ do_autocast = self.torch_inference != '' and self.models[denoiser].fp16
+ with torch.autocast('cuda', enabled=do_autocast):
+ self.profile_start('denoise', color='blue')
+ for step_index, timestep in enumerate(timesteps):
+ # Expand the latents if we are doing classifier free guidance
+ latent_model_input = torch.cat([latents] * 2) if self.do_classifier_free_guidance else latents
+ latent_model_input = self.scheduler.scale_model_input(latent_model_input, timestep)
+ if isinstance(mask, torch.Tensor):
+ latent_model_input = torch.cat([latent_model_input, mask, masked_image_latents], dim=1)
+
+ # Predict the noise residual
+ if self.torch_inference:
+ params = {"sample": latent_model_input, "timestep": timestep, "encoder_hidden_states": text_embeddings}
+ if controlnet_imgs is not None:
+ params.update({"images": controlnet_imgs, "controlnet_scales": controlnet_scales})
+ added_cond_kwargs = {}
+ if text_embeds != None:
+ added_cond_kwargs.update({'text_embeds': text_embeds})
+ if time_ids != None:
+ added_cond_kwargs.update({'time_ids': time_ids})
+ if text_embeds != None or time_ids != None:
+ params.update({'added_cond_kwargs': added_cond_kwargs})
+ noise_pred = self.torch_models[denoiser](**params)["sample"]
+ else:
+ timestep_float = timestep.float() if timestep.dtype != torch.float32 else timestep
+
+ params = {"sample": latent_model_input, "timestep": timestep_float, "encoder_hidden_states": text_embeddings}
+ if controlnet_imgs is not None:
+ params.update({"images": controlnet_imgs, "controlnet_scales": controlnet_scales})
+ if text_embeds != None:
+ params.update({'text_embeds': text_embeds})
+ if time_ids != None:
+ params.update({'time_ids': time_ids})
+ noise_pred = self.runEngine(denoiser, params)['latent']
+
+ # Perform guidance
+ if self.do_classifier_free_guidance:
+ noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
+ noise_pred = noise_pred_uncond + self.guidance_scale * (noise_pred_text - noise_pred_uncond)
+
+ # from diffusers (prepare_extra_step_kwargs)
+ extra_step_kwargs = {}
+ if "eta" in set(inspect.signature(self.scheduler.step).parameters.keys()):
+ # TODO: configurable eta
+ eta = 0.0
+ extra_step_kwargs["eta"] = eta
+ if "generator" in set(inspect.signature(self.scheduler.step).parameters.keys()):
+ extra_step_kwargs["generator"] = self.generator
+
+ latents = self.scheduler.step(noise_pred, timestep, latents, **extra_step_kwargs, return_dict=False)[0]
+
+ latents = 1. / self.vae_scaling_factor * latents
+ latents = latents.to(dtype=torch.float32)
+
+ self.profile_stop('denoise')
return latents
- def encode_image(self, init_image):
- if self.nvtx_profile:
- nvtx_vae = nvtx.start_range(message='vae_encoder', color='red')
- cudart.cudaEventRecord(self.events['vae_encoder-start'], 0)
- init_latents = self.runEngine('vae_encoder', {"images": init_image})['latent']
- cudart.cudaEventRecord(self.events['vae_encoder-stop'], 0)
- if self.nvtx_profile:
- nvtx.end_range(nvtx_vae)
-
- init_latents = 0.18215 * init_latents
- return init_latents
+ def encode_image(self, input_image):
+ self.profile_start('vae_encoder', color='red')
+ if self.torch_inference:
+ image_latents = self.torch_models['vae_encoder'](input_image)
+ else:
+ image_latents = self.runEngine('vae_encoder', {'images': input_image})['latent']
+ image_latents = self.vae_scaling_factor * image_latents
+ self.profile_stop('vae_encoder')
+ return image_latents
def decode_latent(self, latents):
- if self.nvtx_profile:
- nvtx_vae = nvtx.start_range(message='vae', color='red')
- cudart.cudaEventRecord(self.events['vae-start'], 0)
- images = self.runEngine('vae', {"latent": latents})['images']
- cudart.cudaEventRecord(self.events['vae-stop'], 0)
- if self.nvtx_profile:
- nvtx.end_range(nvtx_vae)
+ self.profile_start('vae', color='red')
+ if self.torch_inference:
+ images = self.torch_models['vae'](latents)['sample']
+ else:
+ images = self.runEngine('vae', {'latent': latents})['images']
+ self.profile_stop('vae')
return images
- def print_summary(self, denoising_steps, tic, toc, vae_enc=False):
- print('|------------|--------------|')
- print('| {:^10} | {:^12} |'.format('Module', 'Latency'))
- print('|------------|--------------|')
- if vae_enc:
- print('| {:^10} | {:>9.2f} ms |'.format('VAE-Enc', cudart.cudaEventElapsedTime(self.events['vae_encoder-start'], self.events['vae_encoder-stop'])[1]))
- print('| {:^10} | {:>9.2f} ms |'.format('CLIP', cudart.cudaEventElapsedTime(self.events['clip-start'], self.events['clip-stop'])[1]))
- print('| {:^10} | {:>9.2f} ms |'.format('UNet x '+str(denoising_steps), cudart.cudaEventElapsedTime(self.events['denoise-start'], self.events['denoise-stop'])[1]))
- print('| {:^10} | {:>9.2f} ms |'.format('VAE-Dec', cudart.cudaEventElapsedTime(self.events['vae-start'], self.events['vae-stop'])[1]))
- print('|------------|--------------|')
- print('| {:^10} | {:>9.2f} ms |'.format('Pipeline', (toc - tic)*1000.))
- print('|------------|--------------|')
-
- def save_image(self, images, pipeline, prompt):
- # Save image
- image_name_prefix = pipeline+'-fp16'+''.join(set(['-'+prompt[i].replace(' ','_')[:10] for i in range(len(prompt))]))+'-'
- save_image(images, self.output_dir, image_name_prefix)
+ def print_summary(self, denoising_steps, walltime_ms, batch_size):
+ print('|-----------------|--------------|')
+ print('| {:^15} | {:^12} |'.format('Module', 'Latency'))
+ print('|-----------------|--------------|')
+ if 'vae_encoder' in self.stages:
+ print('| {:^15} | {:>9.2f} ms |'.format('VAE-Enc', cudart.cudaEventElapsedTime(self.events['vae_encoder'][0], self.events['vae_encoder'][1])[1]))
+ print('| {:^15} | {:>9.2f} ms |'.format('CLIP', cudart.cudaEventElapsedTime(self.events['clip'][0], self.events['clip'][1])[1]))
+ print('| {:^15} | {:>9.2f} ms |'.format('UNet'+('+CNet' if self.pipeline_type.is_controlnet() else '')+' x '+str(denoising_steps), cudart.cudaEventElapsedTime(self.events['denoise'][0], self.events['denoise'][1])[1]))
+ print('| {:^15} | {:>9.2f} ms |'.format('VAE-Dec', cudart.cudaEventElapsedTime(self.events['vae'][0], self.events['vae'][1])[1]))
+ print('|-----------------|--------------|')
+ print('| {:^15} | {:>9.2f} ms |'.format('Pipeline', walltime_ms))
+ print('|-----------------|--------------|')
+ print('Throughput: {:.2f} image/s'.format(batch_size*1000./walltime_ms))
+
+ def save_image(self, images, pipeline, prompt, seed):
+ # Save image
+ image_name_prefix = pipeline+''.join(set(['-'+prompt[i].replace(' ','_')[:10] for i in range(len(prompt))]))+'-'+str(seed)+'-'
+ save_image(images, self.output_dir, image_name_prefix)
+
+ def infer(
+ self,
+ prompt,
+ negative_prompt,
+ image_height,
+ image_width,
+ input_image=None,
+ image_strength=0.75,
+ mask_image=None,
+ controlnet_scales=None,
+ aesthetic_score=6.0,
+ negative_aesthetic_score=2.5,
+ warmup=False,
+ verbose=False,
+ save_image=True,
+ ):
+ """
+ Run the diffusion pipeline.
+
+ Args:
+ prompt (str):
+ The text prompt to guide image generation.
+ negative_prompt (str):
+ The prompt not to guide the image generation.
+ image_height (int):
+ Height (in pixels) of the image to be generated. Must be a multiple of 8.
+ image_width (int):
+ Width (in pixels) of the image to be generated. Must be a multiple of 8.
+ input_image (image):
+ Input image used to initialize the latents or to be inpainted.
+ image_strength (float):
+ Strength of transformation applied to input_image. Must be between 0 and 1.
+ mask_image (image):
+ Mask image containg the region to be inpainted.
+ controlnet_scales (torch.Tensor)
+ A tensor which containes ControlNet scales, essential for multi ControlNet.
+ Must be equal to number of Controlnets.
+ warmup (bool):
+ Indicate if this is a warmup run.
+ verbose (bool):
+ Verbose in logging
+ save_image (bool):
+ Save the generated image (if applicable)
+ """
+ assert len(prompt) == len(negative_prompt)
+ batch_size = len(prompt)
+
+ # Spatial dimensions of latent tensor
+ latent_height = image_height // 8
+ latent_width = image_width // 8
+
+ if self.generator and self.seed:
+ self.generator.manual_seed(self.seed)
+
+ num_inference_steps = self.denoising_steps
+
+ with torch.inference_mode(), trt.Runtime(TRT_LOGGER):
+ torch.cuda.synchronize()
+ e2e_tic = time.perf_counter()
+
+ # TODO: support custom timesteps
+ timesteps = None
+ if timesteps is not None:
+ if not ("timesteps" in set(inspect.signature(self.scheduler.set_timesteps).parameters.keys())):
+ raise ValueError(
+ f"The current scheduler class {self.scheduler.__class__}'s `set_timesteps` does not support custom"
+ f" timestep schedules. Please check whether you are using the correct scheduler."
+ )
+ self.scheduler.set_timesteps(timesteps=timesteps, device=self.device)
+ assert self.denoising_steps == len(self.scheduler.timesteps)
+ else:
+ self.scheduler.set_timesteps(self.denoising_steps, device=self.device)
+ timesteps = self.scheduler.timesteps.to(self.device)
+
+ denoise_kwargs = {}
+ if not (self.pipeline_type.is_img2img() or self.pipeline_type.is_sd_xl_refiner()):
+ # Initialize latents
+ latents = self.initialize_latents(batch_size=batch_size,
+ unet_channels=4,
+ latent_height=latent_height,
+ latent_width=latent_width)
+ if self.pipeline_type.is_controlnet():
+ denoise_kwargs.update({'controlnet_imgs': input_image, 'controlnet_scales': controlnet_scales})
+
+ # Pre-process and VAE encode input image
+ if self.pipeline_type.is_img2img() or self.pipeline_type.is_inpaint() or self.pipeline_type.is_sd_xl_refiner():
+ assert input_image != None
+ # Initialize timesteps and pre-process input image
+ timesteps, num_inference_steps = self.get_timesteps(self.denoising_steps, image_strength)
+ denoise_kwargs.update({'timesteps': timesteps})
+ if self.pipeline_type.is_img2img() or self.pipeline_type.is_sd_xl_refiner():
+ latent_timestep = timesteps[:1].repeat(batch_size)
+ input_image = self.preprocess_images(batch_size, (input_image,))[0]
+ # Encode if not a latent
+ image_latents = input_image if input_image.shape[1] == 4 else self.encode_image(input_image)
+ # Add noise to latents using timesteps
+ noise = torch.randn(image_latents.shape, generator=self.generator, device=self.device, dtype=torch.float32)
+ latents = self.scheduler.add_noise(image_latents, noise, latent_timestep)
+ elif self.pipeline_type.is_inpaint():
+ mask, mask_image = self.preprocess_images(batch_size, prepare_mask_and_masked_image(input_image, mask_image))
+ mask = torch.nn.functional.interpolate(mask, size=(latent_height, latent_width))
+ mask = torch.cat([mask] * 2)
+ masked_image_latents = self.encode_image(mask_image)
+ masked_image_latents = torch.cat([masked_image_latents] * 2)
+ denoise_kwargs.update({'mask': mask, 'masked_image_latents': masked_image_latents})
+
+ # CLIP text encoder(s)
+ if self.pipeline_type.is_sd_xl():
+ text_embeddings2, pooled_embeddings2 = self.encode_prompt(prompt, negative_prompt,
+ encoder='clip2', pooled_outputs=True, output_hidden_states=True)
+
+ # Merge text embeddings
+ if self.pipeline_type.is_sd_xl_base():
+ text_embeddings = self.encode_prompt(prompt, negative_prompt, output_hidden_states=True)
+ text_embeddings = torch.cat([text_embeddings, text_embeddings2], dim=-1)
+ else:
+ text_embeddings = text_embeddings2
+
+ # Time embeddings
+ def _get_add_time_ids(original_size, crops_coords_top_left, target_size, dtype, aesthetic_score=None, negative_aesthetic_score=None):
+ if self.pipeline_type.is_sd_xl_refiner(): #self.requires_aesthetics_score:
+ add_time_ids = list(original_size + crops_coords_top_left + (aesthetic_score,))
+ if self.do_classifier_free_guidance:
+ add_neg_time_ids = list(original_size + crops_coords_top_left + (negative_aesthetic_score,))
+ else:
+ add_time_ids = list(original_size + crops_coords_top_left + target_size)
+ if self.do_classifier_free_guidance:
+ add_neg_time_ids = list(original_size + crops_coords_top_left + target_size)
+ add_time_ids = torch.tensor([add_time_ids], dtype=dtype, device=self.device)
+ if self.do_classifier_free_guidance:
+ add_neg_time_ids = torch.tensor([add_neg_time_ids], dtype=dtype, device=self.device)
+ add_time_ids = torch.cat([add_neg_time_ids, add_time_ids], dim=0)
+ return add_time_ids
+
+ original_size = (image_height, image_width)
+ crops_coords_top_left = (0, 0)
+ target_size = (image_height, image_width)
+ if self.pipeline_type.is_sd_xl_refiner():
+ add_time_ids = _get_add_time_ids(
+ original_size, crops_coords_top_left, target_size, dtype=text_embeddings.dtype, aesthetic_score=aesthetic_score, negative_aesthetic_score=negative_aesthetic_score
+ )
+ else:
+ add_time_ids = _get_add_time_ids(
+ original_size, crops_coords_top_left, target_size, dtype=text_embeddings.dtype
+ )
+ add_time_ids = add_time_ids.repeat(batch_size, 1)
+ denoise_kwargs.update({'text_embeds': pooled_embeddings2, 'time_ids': add_time_ids})
+ else:
+ text_embeddings = self.encode_prompt(prompt, negative_prompt)
+
+ # UNet denoiser + (optional) ControlNet(s)
+ denoiser = 'unetxl' if self.pipeline_type.is_sd_xl() else 'unet'
+ latents = self.denoise_latent(latents, text_embeddings, denoiser=denoiser, **denoise_kwargs)
+
+ # VAE decode latent (if applicable)
+ if self.return_latents:
+ latents = latents * self.vae_scaling_factor
+ else:
+ images = self.decode_latent(latents)
+
+ torch.cuda.synchronize()
+ e2e_toc = time.perf_counter()
+
+ walltime_ms = (e2e_toc - e2e_tic) * 1000.
+ if not warmup:
+ self.print_summary(num_inference_steps, walltime_ms, batch_size)
+ if not self.return_latents and save_image:
+ self.save_image(images, self.pipeline_type.name.lower(), prompt, self.seed)
+
+ return (latents, walltime_ms) if self.return_latents else (images, walltime_ms)
+
+ def run(self, prompt, negative_prompt, height, width, batch_size, batch_count, num_warmup_runs, use_cuda_graph, **kwargs):
+ # Process prompt
+ if not isinstance(prompt, list):
+ raise ValueError(f"`prompt` must be of type `str` list, but is {type(prompt)}")
+ prompt = prompt * batch_size
+
+ if not isinstance(negative_prompt, list):
+ raise ValueError(f"`--negative-prompt` must be of type `str` list, but is {type(negative_prompt)}")
+ if len(negative_prompt) == 1:
+ negative_prompt = negative_prompt * batch_size
+
+ num_warmup_runs = max(1, num_warmup_runs) if use_cuda_graph else num_warmup_runs
+ if num_warmup_runs > 0:
+ print("[I] Warming up ..")
+ for _ in range(num_warmup_runs):
+ self.infer(prompt, negative_prompt, height, width, warmup=True, **kwargs)
+
+ for _ in range(batch_count):
+ print("[I] Running StableDiffusion pipeline")
+ if self.nvtx_profile:
+ cudart.cudaProfilerStart()
+ self.infer(prompt, negative_prompt, height, width, warmup=False, **kwargs)
+ if self.nvtx_profile:
+ cudart.cudaProfilerStop()
diff --git a/demo/Diffusion/txt2img_pipeline.py b/demo/Diffusion/txt2img_pipeline.py
deleted file mode 100755
index 7a87cd1c..00000000
--- a/demo/Diffusion/txt2img_pipeline.py
+++ /dev/null
@@ -1,102 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-import numpy as np
-import nvtx
-import time
-import torch
-import tensorrt as trt
-from utilities import TRT_LOGGER
-from stable_diffusion_pipeline import StableDiffusionPipeline
-
-class Txt2ImgPipeline(StableDiffusionPipeline):
- """
- Application showcasing the acceleration of Stable Diffusion Txt2Img v1.4, v1.5, v2.0, v2.0-base, v2.1, v2.1-base pipeline using NVidia TensorRT w/ Plugins.
- """
- def __init__(
- self,
- scheduler="DDIM",
- *args, **kwargs
- ):
- """
- Initializes the Txt2Img Diffusion pipeline.
-
- Args:
- scheduler (str):
- The scheduler to guide the denoising process. Must be one of the [DPM, LMSD, DDIM, EulerA, PNDM].
- """
- super(Txt2ImgPipeline, self).__init__(*args, **kwargs, \
- scheduler=scheduler, stages=['clip','unet','vae'])
-
- def infer(
- self,
- prompt,
- negative_prompt,
- image_height,
- image_width,
- seed=None,
- warmup=False,
- verbose=False
- ):
- """
- Run the diffusion pipeline.
-
- Args:
- prompt (str):
- The text prompt to guide image generation.
- negative_prompt (str):
- The prompt not to guide the image generation.
- image_height (int):
- Height (in pixels) of the image to be generated. Must be a multiple of 8.
- image_width (int):
- Width (in pixels) of the image to be generated. Must be a multiple of 8.
- seed (int):
- Seed for the random generator
- warmup (bool):
- Indicate if this is a warmup run.
- verbose (bool):
- Verbose in logging
- """
- assert len(prompt) == len(negative_prompt)
-
- with torch.inference_mode(), torch.autocast("cuda"), trt.Runtime(TRT_LOGGER):
- # Pre-initialize latents
- latents = self.initialize_latents( \
- batch_size=len(prompt), \
- unet_channels=4, \
- latent_height=(image_height // 8), \
- latent_width=(image_width // 8)
- )
-
- torch.cuda.synchronize()
- e2e_tic = time.perf_counter()
-
- # CLIP text encoder
- text_embeddings = self.encode_prompt(prompt, negative_prompt)
-
- # UNet denoiser
- latents = self.denoise_latent(latents, text_embeddings)
-
- # VAE decode latent
- images = self.decode_latent(latents)
-
- torch.cuda.synchronize()
- e2e_toc = time.perf_counter()
-
- if not warmup:
- self.print_summary(self.denoising_steps, e2e_tic, e2e_toc)
- self.save_image(images, 'txt2img', prompt)
diff --git a/demo/Diffusion/utilities.py b/demo/Diffusion/utilities.py
index fad7c4aa..62d582f5 100644
--- a/demo/Diffusion/utilities.py
+++ b/demo/Diffusion/utilities.py
@@ -1,5 +1,4 @@
#
-# Copyright 2022 The HuggingFace Inc. team.
# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
@@ -17,25 +16,36 @@
#
from collections import OrderedDict
-from copy import copy
+from cuda import cudart
+from diffusers.models.lora import LoRACompatibleConv, LoRACompatibleLinear
+from diffusers.utils.torch_utils import randn_tensor
+from enum import Enum, auto
+import gc
+from io import BytesIO
import numpy as np
import onnx
+from onnx import numpy_helper
import onnx_graphsurgeon as gs
import os
-import math
from PIL import Image
from polygraphy.backend.common import bytes_from_path
-from polygraphy.backend.trt import CreateConfig, Profile
-from polygraphy.backend.trt import engine_from_bytes, engine_from_network, network_from_onnx_path, save_engine
-from polygraphy.backend.trt import util as trt_util
-from polygraphy import cuda
+from polygraphy.backend.trt import (
+ CreateConfig,
+ ModifyNetworkOutputs,
+ Profile,
+ engine_from_bytes,
+ engine_from_network,
+ network_from_onnx_path,
+ save_engine
+)
+from polygraphy.logger import G_LOGGER
import random
+import re
+import requests
from scipy import integrate
import tensorrt as trt
import torch
-import requests
-from io import BytesIO
-from cuda import cudart
+import types
TRT_LOGGER = trt.Logger(trt.Logger.ERROR)
@@ -60,6 +70,79 @@
# Map of torch dtype -> numpy dtype
torch_to_numpy_dtype_dict = {value : key for (key, value) in numpy_to_torch_dtype_dict.items()}
+def unload_model(model):
+ if model:
+ del model
+ torch.cuda.empty_cache()
+ gc.collect()
+
+def replace_lora_layers(model):
+ def lora_forward(self, x, scale=None):
+ return self._torch_forward(x)
+
+ for name, module in model.named_modules():
+ if isinstance(module, LoRACompatibleConv):
+ in_channels = module.in_channels
+ out_channels = module.out_channels
+ kernel_size = module.kernel_size
+ stride = module.stride
+ padding = module.padding
+ dilation = module.dilation
+ groups = module.groups
+ bias = module.bias
+
+ new_conv = torch.nn.Conv2d(
+ in_channels,
+ out_channels,
+ kernel_size,
+ stride=stride,
+ padding=padding,
+ dilation=dilation,
+ groups=groups,
+ bias=bias is not None,
+ )
+
+ new_conv.weight.data = module.weight.data.clone().to(module.weight.data.device)
+ if bias is not None:
+ new_conv.bias.data = module.bias.data.clone().to(module.bias.data.device)
+
+ # Replace the LoRACompatibleConv layer with the Conv2d layer
+ path = name.split(".")
+ sub_module = model
+ for p in path[:-1]:
+ sub_module = getattr(sub_module, p)
+ setattr(sub_module, path[-1], new_conv)
+ new_conv._torch_forward = new_conv.forward
+ new_conv.forward = types.MethodType(lora_forward, new_conv)
+
+ elif isinstance(module, LoRACompatibleLinear):
+ in_features = module.in_features
+ out_features = module.out_features
+ bias = module.bias
+
+ new_linear = torch.nn.Linear(in_features, out_features, bias=bias is not None)
+
+ new_linear.weight.data = module.weight.data.clone().to(module.weight.data.device)
+ if bias is not None:
+ new_linear.bias.data = module.bias.data.clone().to(module.bias.data.device)
+
+ # Replace the LoRACompatibleLinear layer with the Linear layer
+ path = name.split(".")
+ sub_module = model
+ for p in path[:-1]:
+ sub_module = getattr(sub_module, p)
+ setattr(sub_module, path[-1], new_linear)
+ new_linear._torch_forward = new_linear.forward
+ new_linear.forward = types.MethodType(lora_forward, new_linear)
+
+def merge_loras(model, lora_dict, lora_alphas, lora_scales):
+ assert len(lora_scales) == len(lora_dict)
+ for path, lora in lora_dict.items():
+ print(f"[I] Fusing LoRA: {path}, scale {lora_scales[path]}")
+ model.load_attn_procs(lora, network_alphas=lora_alphas[path])
+ model.fuse_lora(lora_scale=lora_scales[path])
+ return model
+
def CUASSERT(cuda_ret):
err = cuda_ret[0]
if err != cudart.cudaError_t.cudaSuccess:
@@ -68,6 +151,35 @@ def CUASSERT(cuda_ret):
return cuda_ret[1]
return None
+class PIPELINE_TYPE(Enum):
+ TXT2IMG = auto()
+ IMG2IMG = auto()
+ INPAINT = auto()
+ CONTROLNET = auto()
+ XL_BASE = auto()
+ XL_REFINER = auto()
+
+ def is_txt2img(self):
+ return self == self.TXT2IMG
+
+ def is_img2img(self):
+ return self == self.IMG2IMG
+
+ def is_inpaint(self):
+ return self == self.INPAINT
+
+ def is_controlnet(self):
+ return self == self.CONTROLNET
+
+ def is_sd_xl_base(self):
+ return self == self.XL_BASE
+
+ def is_sd_xl_refiner(self):
+ return self == self.XL_REFINER
+
+ def is_sd_xl(self):
+ return self.is_sd_xl_base() or self.is_sd_xl_refiner()
+
class Engine():
def __init__(
self,
@@ -81,116 +193,55 @@ def __init__(
self.cuda_graph_instance = None # cuda graph
def __del__(self):
- [buf.free() for buf in self.buffers.values() if isinstance(buf, cuda.DeviceArray) ]
del self.engine
del self.context
del self.buffers
del self.tensors
- def refit(self, onnx_path, onnx_refit_path):
- def convert_int64(arr):
- # TODO: smarter conversion
- if len(arr.shape) == 0:
- return np.int32(arr)
- return arr
-
- def add_to_map(refit_dict, name, values):
- if name in refit_dict:
- assert refit_dict[name] is None
- if values.dtype == np.int64:
- values = convert_int64(values)
- refit_dict[name] = values
-
- print(f"Refitting TensorRT engine with {onnx_refit_path} weights")
- refit_nodes = gs.import_onnx(onnx.load(onnx_refit_path)).toposort().nodes
-
- # Construct mapping from weight names in refit model -> original model
- name_map = {}
- for n, node in enumerate(gs.import_onnx(onnx.load(onnx_path)).toposort().nodes):
- refit_node = refit_nodes[n]
- assert node.op == refit_node.op
- # Constant nodes in ONNX do not have inputs but have a constant output
- if node.op == "Constant":
- name_map[refit_node.outputs[0].name] = node.outputs[0].name
- # Handle scale and bias weights
- elif node.op == "Conv":
- if node.inputs[1].__class__ == gs.Constant:
- name_map[refit_node.name+"_TRTKERNEL"] = node.name+"_TRTKERNEL"
- if node.inputs[2].__class__ == gs.Constant:
- name_map[refit_node.name+"_TRTBIAS"] = node.name+"_TRTBIAS"
- # For all other nodes: find node inputs that are initializers (gs.Constant)
- else:
- for i, inp in enumerate(node.inputs):
- if inp.__class__ == gs.Constant:
- name_map[refit_node.inputs[i].name] = inp.name
- def map_name(name):
- if name in name_map:
- return name_map[name]
- return name
-
- # Construct refit dictionary
- refit_dict = {}
+ def refit(self, refit_weights, is_fp16):
+ # Initialize refitter
refitter = trt.Refitter(self.engine, TRT_LOGGER)
- all_weights = refitter.get_all()
- for layer_name, role in zip(all_weights[0], all_weights[1]):
- # for speciailized roles, use a unique name in the map:
- if role == trt.WeightsRole.KERNEL:
- name = layer_name+"_TRTKERNEL"
- elif role == trt.WeightsRole.BIAS:
- name = layer_name+"_TRTBIAS"
- else:
- name = layer_name
-
- assert name not in refit_dict, "Found duplicate layer: " + name
- refit_dict[name] = None
-
-
- for n in refit_nodes:
- # Constant nodes in ONNX do not have inputs but have a constant output
- if n.op == "Constant":
- name = map_name(n.outputs[0].name)
- print(f"Add Constant {name}\n")
- add_to_map(refit_dict, name, n.outputs[0].values)
- # Handle scale and bias weights
- elif n.op == "Conv":
- if n.inputs[1].__class__ == gs.Constant:
- name = map_name(n.name+"_TRTKERNEL")
- add_to_map(refit_dict, name, n.inputs[1].values)
+ refitted_weights = set()
+ # iterate through all tensorrt refittable weights
+ for trt_weight_name in refitter.get_all_weights():
+ if trt_weight_name not in refit_weights:
+ continue
- if n.inputs[2].__class__ == gs.Constant:
- name = map_name(n.name+"_TRTBIAS")
- add_to_map(refit_dict, name, n.inputs[2].values)
+ # get weight from state dict
+ trt_datatype = trt.DataType.FLOAT
+ if is_fp16:
+ refit_weights[trt_weight_name] = refit_weights[trt_weight_name].half()
+ trt_datatype = trt.DataType.HALF
- # For all other nodes: find node inputs that are initializers (AKA gs.Constant)
- else:
- for inp in n.inputs:
- name = map_name(inp.name)
- if inp.__class__ == gs.Constant:
- add_to_map(refit_dict, name, inp.values)
-
- for layer_name, weights_role in zip(all_weights[0], all_weights[1]):
- if weights_role == trt.WeightsRole.KERNEL:
- custom_name = layer_name+"_TRTKERNEL"
- elif weights_role == trt.WeightsRole.BIAS:
- custom_name = layer_name+"_TRTBIAS"
- else:
- custom_name = layer_name
+ # trt.Weight and trt.TensorLocation
+ trt_wt_tensor = trt.Weights(trt_datatype, refit_weights[trt_weight_name].data_ptr(), torch.numel(refit_weights[trt_weight_name]))
+ trt_wt_location = trt.TensorLocation.DEVICE if refit_weights[trt_weight_name].is_cuda else trt.TensorLocation.HOST
- # Skip refitting Trilu for now; scalar weights of type int64 value 1 - for clip model
- if layer_name.startswith("onnx::Trilu"):
- continue
-
- if refit_dict[custom_name] is not None:
- refitter.set_weights(layer_name, weights_role, refit_dict[custom_name])
- else:
- print(f"[W] No refit weights for layer: {layer_name}")
+ # apply refit
+ refitter.set_named_weights(trt_weight_name, trt_wt_tensor, trt_wt_location)
+ refitted_weights.add(trt_weight_name)
+ assert set(refitted_weights) == set(refit_weights.keys())
if not refitter.refit_cuda_engine():
- print("Failed to refit!")
+ print("Error: failed to refit new weights.")
exit(0)
- def build(self, onnx_path, fp16, input_profile=None, enable_refit=False, enable_preview=False, enable_all_tactics=False, timing_cache=None, workspace_size=0):
+ print(f"[I] Total refitted weights {len(refitted_weights)}.")
+
+ def build(self,
+ onnx_path,
+ fp16=True,
+ tf32=False,
+ int8=False,
+ input_profile=None,
+ enable_refit=False,
+ enable_all_tactics=False,
+ timing_cache=None,
+ update_output_names=None,
+ verbose=False,
+ **extra_build_args
+ ):
print(f"Building TensorRT engine for {onnx_path}: {self.engine_path}")
p = Profile()
if input_profile:
@@ -198,28 +249,27 @@ def build(self, onnx_path, fp16, input_profile=None, enable_refit=False, enable_
assert len(dims) == 3
p.add(name, min=dims[0], opt=dims[1], max=dims[2])
- config_kwargs = {}
-
- config_kwargs['preview_features'] = [trt.PreviewFeature.DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]
- if enable_preview:
- # Faster dynamic shapes made optional since it increases engine build time.
- config_kwargs['preview_features'].append(trt.PreviewFeature.FASTER_DYNAMIC_SHAPES_0805)
- if workspace_size > 0:
- config_kwargs['memory_pool_limits'] = {trt.MemoryPoolType.WORKSPACE: workspace_size}
if not enable_all_tactics:
- config_kwargs['tactic_sources'] = []
-
- engine = engine_from_network(
- network_from_onnx_path(onnx_path, flags=[trt.OnnxParserFlag.NATIVE_INSTANCENORM]),
- config=CreateConfig(fp16=fp16,
- refittable=enable_refit,
- profiles=[p],
- load_timing_cache=timing_cache,
- **config_kwargs
- ),
- save_timing_cache=timing_cache
- )
- save_engine(engine, path=self.engine_path)
+ extra_build_args['tactic_sources'] = []
+
+ network = network_from_onnx_path(onnx_path, flags=[trt.OnnxParserFlag.NATIVE_INSTANCENORM])
+ if update_output_names:
+ print(f"Updating network outputs to {update_output_names}")
+ network = ModifyNetworkOutputs(network, update_output_names)
+ with G_LOGGER.verbosity(G_LOGGER.EXTRA_VERBOSE if verbose else G_LOGGER.ERROR):
+ engine = engine_from_network(
+ network,
+ config=CreateConfig(fp16=fp16,
+ tf32=tf32,
+ int8=int8,
+ refittable=enable_refit,
+ profiles=[p],
+ load_timing_cache=timing_cache,
+ **extra_build_args
+ ),
+ save_timing_cache=timing_cache
+ )
+ save_engine(engine, path=self.engine_path)
def load(self):
print(f"Loading TensorRT engine: {self.engine_path}")
@@ -233,19 +283,21 @@ def activate(self, reuse_device_memory=None):
self.context = self.engine.create_execution_context()
def allocate_buffers(self, shape_dict=None, device='cuda'):
- for idx in range(trt_util.get_bindings_per_profile(self.engine)):
- binding = self.engine[idx]
- if shape_dict and binding in shape_dict:
- shape = shape_dict[binding]
+ for binding in range(self.engine.num_io_tensors):
+ name = self.engine.get_tensor_name(binding)
+ if shape_dict and name in shape_dict:
+ shape = shape_dict[name]
else:
- shape = self.engine.get_binding_shape(binding)
- dtype = trt.nptype(self.engine.get_binding_dtype(binding))
- if self.engine.binding_is_input(binding):
- self.context.set_binding_shape(idx, shape)
+ shape = self.engine.get_tensor_shape(name)
+ dtype = trt.nptype(self.engine.get_tensor_dtype(name))
+ if self.engine.get_tensor_mode(name) == trt.TensorIOMode.INPUT:
+ self.context.set_input_shape(name, shape)
tensor = torch.empty(tuple(shape), dtype=numpy_to_torch_dtype_dict[dtype]).to(device=device)
- self.tensors[binding] = tensor
+ self.tensors[name] = tensor
+
def infer(self, feed_dict, stream, use_cuda_graph=False):
+
for name, buf in feed_dict.items():
self.tensors[name].copy_(buf)
@@ -254,883 +306,25 @@ def infer(self, feed_dict, stream, use_cuda_graph=False):
if use_cuda_graph:
if self.cuda_graph_instance is not None:
- CUASSERT(cudart.cudaGraphLaunch(self.cuda_graph_instance, stream.ptr))
- CUASSERT(cudart.cudaStreamSynchronize(stream.ptr))
+ CUASSERT(cudart.cudaGraphLaunch(self.cuda_graph_instance, stream))
+ CUASSERT(cudart.cudaStreamSynchronize(stream))
else:
# do inference before CUDA graph capture
- noerror = self.context.execute_async_v3(stream.ptr)
+ noerror = self.context.execute_async_v3(stream)
if not noerror:
raise ValueError(f"ERROR: inference failed.")
# capture cuda graph
- CUASSERT(cudart.cudaStreamBeginCapture(stream.ptr, cudart.cudaStreamCaptureMode.cudaStreamCaptureModeGlobal))
- self.context.execute_async_v3(stream.ptr)
- self.graph = CUASSERT(cudart.cudaStreamEndCapture(stream.ptr))
+ CUASSERT(cudart.cudaStreamBeginCapture(stream, cudart.cudaStreamCaptureMode.cudaStreamCaptureModeGlobal))
+ self.context.execute_async_v3(stream)
+ self.graph = CUASSERT(cudart.cudaStreamEndCapture(stream))
self.cuda_graph_instance = CUASSERT(cudart.cudaGraphInstantiate(self.graph, 0))
else:
- noerror = self.context.execute_async_v3(stream.ptr)
+ noerror = self.context.execute_async_v3(stream)
if not noerror:
raise ValueError(f"ERROR: inference failed.")
return self.tensors
-
-class LMSDiscreteScheduler():
- def __init__(
- self,
- device = 'cuda',
- beta_start = 0.00085,
- beta_end = 0.012,
- num_train_timesteps = 1000,
- steps_offset = 0,
- prediction_type = 'epsilon'
- ):
- self.num_train_timesteps = num_train_timesteps
- self.order = 4
-
- self.beta_start = beta_start
- self.beta_end = beta_end
- betas = (torch.linspace(beta_start**0.5, beta_end**0.5, self.num_train_timesteps, dtype=torch.float32) ** 2)
- alphas = 1.0 - betas
- self.alphas_cumprod = torch.cumprod(alphas, dim=0)
-
- sigmas = np.array(((1 - self.alphas_cumprod) / self.alphas_cumprod) ** 0.5)
- sigmas = np.concatenate([sigmas[::-1], [0.0]]).astype(np.float32)
- self.sigmas = torch.from_numpy(sigmas)
-
- # standard deviation of the initial noise distribution
- self.init_noise_sigma = self.sigmas.max()
-
- self.device = device
- self.steps_offset = steps_offset
- self.prediction_type = prediction_type
-
- def set_timesteps(self, steps):
- self.num_inference_steps = steps
-
- timesteps = np.linspace(0, self.num_train_timesteps - 1, steps, dtype=float)[::-1].copy()
- sigmas = np.array(((1 - self.alphas_cumprod) / self.alphas_cumprod) ** 0.5)
- sigmas = np.interp(timesteps, np.arange(0, len(sigmas)), sigmas)
- sigmas = np.concatenate([sigmas, [0.0]]).astype(np.float32)
- self.sigmas = torch.from_numpy(sigmas).to(device=self.device)
-
- # Move all timesteps to correct device beforehand
- self.timesteps = torch.from_numpy(timesteps).to(device=self.device).float()
- self.derivatives = []
-
- def scale_model_input(self, sample: torch.FloatTensor, idx, *args, **kwargs) -> torch.FloatTensor:
- return sample * self.latent_scales[idx]
-
- def configure(self):
- order = self.order
- self.lms_coeffs = []
- self.latent_scales = [1./((sigma**2 + 1) ** 0.5) for sigma in self.sigmas]
-
- def get_lms_coefficient(order, t, current_order):
- """
- Compute a linear multistep coefficient.
- """
- def lms_derivative(tau):
- prod = 1.0
- for k in range(order):
- if current_order == k:
- continue
- prod *= (tau - self.sigmas[t - k]) / (self.sigmas[t - current_order] - self.sigmas[t - k])
- return prod
- integrated_coeff = integrate.quad(lms_derivative, self.sigmas[t], self.sigmas[t + 1], epsrel=1e-4)[0]
- return integrated_coeff
-
- for step_index in range(self.num_inference_steps):
- order = min(step_index + 1, order)
- self.lms_coeffs.append([get_lms_coefficient(order, step_index, curr_order) for curr_order in range(order)])
-
- def step(self, output, latents, idx, timestep):
- # compute the previous noisy sample x_t -> x_t-1
- # 1. compute predicted original sample (x_0) from sigma-scaled predicted noise
- sigma = self.sigmas[idx]
- if self.prediction_type == "epsilon":
- pred_original_sample = latents - sigma * output
- elif self.prediction_type == "v_prediction":
- # * c_out + input * c_skip
- pred_original_sample = output * (-sigma / (sigma**2 + 1) ** 0.5) + (latents / (sigma**2 + 1))
- else:
- raise ValueError(
- f"prediction_type given as {self.config.prediction_type} must be one of `epsilon`, or `v_prediction`"
- )
- # 2. Convert to an ODE derivative
- derivative = (latents - pred_original_sample) / sigma
- self.derivatives.append(derivative)
- if len(self.derivatives) > self.order:
- self.derivatives.pop(0)
- # 3. Compute previous sample based on the derivatives path
- prev_sample = latents + sum(
- coeff * derivative for coeff, derivative in zip(self.lms_coeffs[idx], reversed(self.derivatives))
- )
-
- return prev_sample
-
- def add_noise(self, init_latents, noise, idx, latent_timestep):
- sigma = self.sigmas[idx]
-
- noisy_latents = init_latents + noise * sigma
- return noisy_latents
-
-class DDIMScheduler():
- def __init__(
- self,
- device='cuda',
- num_train_timesteps: int = 1000,
- beta_start: float = 0.0001,
- beta_end: float = 0.02,
- clip_sample: bool = False,
- set_alpha_to_one: bool = False,
- steps_offset: int = 1,
- prediction_type: str = "epsilon",
- ):
- # this schedule is very specific to the latent diffusion model.
- betas = (
- torch.linspace(beta_start**0.5, beta_end**0.5, num_train_timesteps, dtype=torch.float32) ** 2
- )
-
- alphas = 1.0 - betas
- self.alphas_cumprod = torch.cumprod(alphas, dim=0)
-
- # standard deviation of the initial noise distribution
- self.init_noise_sigma = 1.0
-
- # At every step in ddim, we are looking into the previous alphas_cumprod
- # For the final step, there is no previous alphas_cumprod because we are already at 0
- # `set_alpha_to_one` decides whether we set this parameter simply to one or
- # whether we use the final alpha of the "non-previous" one.
- self.final_alpha_cumprod = torch.tensor(1.0) if set_alpha_to_one else self.alphas_cumprod[0]
-
- # setable values
- self.num_inference_steps = None
- self.timesteps = torch.from_numpy(np.arange(0, num_train_timesteps)[::-1].copy().astype(np.int64))
- self.steps_offset = steps_offset
- self.num_train_timesteps = num_train_timesteps
- self.clip_sample = clip_sample
- self.prediction_type = prediction_type
- self.device = device
-
- def configure(self):
- variance = np.zeros(self.num_inference_steps, dtype=np.float32)
- for idx, timestep in enumerate(self.timesteps):
- prev_timestep = timestep - self.num_train_timesteps // self.num_inference_steps
- variance[idx] = self._get_variance(timestep, prev_timestep)
- self.variance = torch.from_numpy(variance).to(self.device)
-
- timesteps = self.timesteps.long().cpu()
- self.alphas_cumprod = self.alphas_cumprod[timesteps].to(self.device)
- self.final_alpha_cumprod = self.final_alpha_cumprod.to(self.device)
-
- def scale_model_input(self, sample: torch.FloatTensor, idx, *args, **kwargs) -> torch.FloatTensor:
- return sample
-
- def _get_variance(self, timestep, prev_timestep):
- alpha_prod_t = self.alphas_cumprod[timestep]
- alpha_prod_t_prev = self.alphas_cumprod[prev_timestep] if prev_timestep >= 0 else self.final_alpha_cumprod
- beta_prod_t = 1 - alpha_prod_t
- beta_prod_t_prev = 1 - alpha_prod_t_prev
-
- variance = (beta_prod_t_prev / beta_prod_t) * (1 - alpha_prod_t / alpha_prod_t_prev)
-
- return variance
-
- def set_timesteps(self, num_inference_steps: int):
- self.num_inference_steps = num_inference_steps
- step_ratio = self.num_train_timesteps // self.num_inference_steps
- # creates integer timesteps by multiplying by ratio
- # casting to int to avoid issues when num_inference_step is power of 3
- timesteps = (np.arange(0, num_inference_steps) * step_ratio).round()[::-1].copy().astype(np.int64)
- self.timesteps = torch.from_numpy(timesteps).to(self.device)
- self.timesteps += self.steps_offset
-
- def step(self, model_output, sample, idx, timestep,
- eta: float = 0.0,
- use_clipped_model_output: bool = False,
- generator=None,
- variance_noise: torch.FloatTensor = None,
- ):
- if self.num_inference_steps is None:
- raise ValueError(
- "Number of inference steps is 'None', you need to run 'set_timesteps' after creating the scheduler"
- )
-
- # See formulas (12) and (16) of DDIM paper https://arxiv.org/pdf/2010.02502.pdf
- # Ideally, read DDIM paper in-detail understanding
-
- # Notation ( ->
- # - pred_noise_t -> e_theta(x_t, t)
- # - pred_original_sample -> f_theta(x_t, t) or x_0
- # - std_dev_t -> sigma_t
- # - eta -> η
- # - pred_sample_direction -> "direction pointing to x_t"
- # - pred_prev_sample -> "x_t-1"
-
- prev_idx = idx + 1
- alpha_prod_t = self.alphas_cumprod[idx]
- alpha_prod_t_prev = self.alphas_cumprod[prev_idx] if prev_idx < self.num_inference_steps else self.final_alpha_cumprod
-
- beta_prod_t = 1 - alpha_prod_t
-
- # 3. compute predicted original sample from predicted noise also called
- # "predicted x_0" of formula (12) from https://arxiv.org/pdf/2010.02502.pdf
- if self.prediction_type == "epsilon":
- pred_original_sample = (sample - beta_prod_t ** (0.5) * model_output) / alpha_prod_t ** (0.5)
- elif self.prediction_type == "sample":
- pred_original_sample = model_output
- elif self.prediction_type == "v_prediction":
- pred_original_sample = (alpha_prod_t**0.5) * sample - (beta_prod_t**0.5) * model_output
- # predict V
- model_output = (alpha_prod_t**0.5) * model_output + (beta_prod_t**0.5) * sample
- else:
- raise ValueError(
- f"prediction_type given as {self.prediction_type} must be one of `epsilon`, `sample`, or"
- " `v_prediction`"
- )
-
- # 4. Clip "predicted x_0"
- if self.clip_sample:
- pred_original_sample = torch.clamp(pred_original_sample, -1, 1)
-
- # 5. compute variance: "sigma_t(η)" -> see formula (16)
- # σ_t = sqrt((1 − α_t−1)/(1 − α_t)) * sqrt(1 − α_t/α_t−1)
- variance = self.variance[idx]
- std_dev_t = eta * variance ** (0.5)
-
- if use_clipped_model_output:
- # the model_output is always re-derived from the clipped x_0 in Glide
- model_output = (sample - alpha_prod_t ** (0.5) * pred_original_sample) / beta_prod_t ** (0.5)
-
- # 6. compute "direction pointing to x_t" of formula (12) from https://arxiv.org/pdf/2010.02502.pdf
- pred_sample_direction = (1 - alpha_prod_t_prev - std_dev_t**2) ** (0.5) * model_output
-
- # 7. compute x_t without "random noise" of formula (12) from https://arxiv.org/pdf/2010.02502.pdf
- prev_sample = alpha_prod_t_prev ** (0.5) * pred_original_sample + pred_sample_direction
-
- if eta > 0:
- # randn_like does not support generator https://github.com/pytorch/pytorch/issues/27072
- device = model_output.device
- if variance_noise is not None and generator is not None:
- raise ValueError(
- "Cannot pass both generator and variance_noise. Please make sure that either `generator` or"
- " `variance_noise` stays `None`."
- )
-
- if variance_noise is None:
- variance_noise = torch.randn(
- model_output.shape, generator=generator, device=device, dtype=model_output.dtype
- )
- variance = variance ** (0.5) * eta * variance_noise
-
- prev_sample = prev_sample + variance
-
- return prev_sample
-
- def add_noise(self, init_latents, noise, idx, latent_timestep):
- sqrt_alpha_prod = self.alphas_cumprod[idx] ** 0.5
- sqrt_one_minus_alpha_prod = (1 - self.alphas_cumprod[idx]) ** 0.5
- noisy_latents = sqrt_alpha_prod * init_latents + sqrt_one_minus_alpha_prod * noise
-
- return noisy_latents
-
-
-class EulerAncestralDiscreteScheduler():
- def __init__(
- self,
- num_train_timesteps: int = 1000,
- beta_start: float = 0.0001,
- beta_end: float = 0.02,
- device = 'cuda',
- steps_offset = 0,
- prediction_type = "epsilon"
- ):
- # this schedule is very specific to the latent diffusion model.
- betas = (
- torch.linspace(beta_start**0.5, beta_end**0.5, num_train_timesteps, dtype=torch.float32) ** 2
- )
-
- alphas = 1.0 - betas
- self.alphas_cumprod = torch.cumprod(alphas, dim=0)
-
- sigmas = np.array(((1 - self.alphas_cumprod) / self.alphas_cumprod) ** 0.5)
- sigmas = np.concatenate([sigmas[::-1], [0.0]]).astype(np.float32)
- self.sigmas = torch.from_numpy(sigmas)
-
- # standard deviation of the initial noise distribution
- self.init_noise_sigma = self.sigmas.max()
-
- # setable values
- self.num_inference_steps = None
- timesteps = np.linspace(0, num_train_timesteps - 1, num_train_timesteps, dtype=float)[::-1].copy()
- self.timesteps = torch.from_numpy(timesteps)
- self.is_scale_input_called = False
- self.device = device
- self.num_train_timesteps = num_train_timesteps
- self.steps_offset = steps_offset
- self.prediction_type = prediction_type
-
- def scale_model_input(
- self, sample: torch.FloatTensor, idx, timestep, *args, **kwargs
- ) -> torch.FloatTensor:
- if isinstance(timestep, torch.Tensor):
- timestep = timestep.to(self.timesteps.device)
- step_index = (self.timesteps == timestep).nonzero().item()
- sigma = self.sigmas[step_index]
- sample = sample / ((sigma**2 + 1) ** 0.5)
- self.is_scale_input_called = True
- return sample
-
- def set_timesteps(self, num_inference_steps: int):
- self.num_inference_steps = num_inference_steps
-
- timesteps = np.linspace(0, self.num_train_timesteps - 1, num_inference_steps, dtype=np.float32)[::-1].copy()
- sigmas = np.array(((1 - self.alphas_cumprod) / self.alphas_cumprod) ** 0.5)
- sigmas = np.interp(timesteps, np.arange(0, len(sigmas)), sigmas)
- sigmas = np.concatenate([sigmas, [0.0]]).astype(np.float32)
- self.sigmas = torch.from_numpy(sigmas).to(device=self.device)
- self.timesteps = torch.from_numpy(timesteps).to(device=self.device)
-
- def configure(self):
- dts = np.zeros(self.num_inference_steps, dtype=np.float32)
- sigmas_up = np.zeros(self.num_inference_steps, dtype=np.float32)
- for idx, timestep in enumerate(self.timesteps):
- step_index = (self.timesteps == timestep).nonzero().item()
- sigma = self.sigmas[step_index]
-
- sigma_from = self.sigmas[step_index]
- sigma_to = self.sigmas[step_index + 1]
- sigma_up = (sigma_to**2 * (sigma_from**2 - sigma_to**2) / sigma_from**2) ** 0.5
- sigma_down = (sigma_to**2 - sigma_up**2) ** 0.5
- dt = sigma_down - sigma
- dts[idx] = dt
- sigmas_up[idx] = sigma_up
-
- self.dts = torch.from_numpy(dts).to(self.device)
- self.sigmas_up = torch.from_numpy(sigmas_up).to(self.device)
-
- def step(
- self, model_output, sample, idx, timestep,
- generator = None,
- ):
- step_index = (self.timesteps == timestep).nonzero().item()
- sigma = self.sigmas[step_index]
-
- # 1. compute predicted original sample (x_0) from sigma-scaled predicted noise
- if self.prediction_type == "epsilon":
- pred_original_sample = sample - sigma * model_output
- elif self.prediction_type == "v_prediction":
- # * c_out + input * c_skip
- pred_original_sample = model_output * (-sigma / (sigma**2 + 1) ** 0.5) + (sample / (sigma**2 + 1))
- else:
- raise ValueError(
- f"prediction_type given as {self.prediction_type} must be one of `epsilon`, or `v_prediction`"
- )
-
- sigma_up = self.sigmas_up[idx]
-
- # 2. Convert to an ODE derivative
- derivative = (sample - pred_original_sample) / sigma
-
- dt = self.dts[idx]
-
- prev_sample = sample + derivative * dt
-
- device = model_output.device
- noise = torch.randn(model_output.shape, dtype=model_output.dtype, device=device, generator=generator).to(
- device
- )
-
- prev_sample = prev_sample + noise * sigma_up
-
- return prev_sample
-
- def add_noise(
- self, original_samples, noise, idx, timestep=None):
- step_index = (self.timesteps == timestep).nonzero().item()
- noisy_samples = original_samples + noise * self.sigmas[step_index]
- return noisy_samples
-
-
-class DPMScheduler():
- def __init__(
- self,
- beta_start = 0.00085,
- beta_end = 0.012,
- num_train_timesteps = 1000,
- solver_order = 2,
- predict_epsilon = True,
- thresholding = False,
- dynamic_thresholding_ratio = 0.995,
- sample_max_value = 1.0,
- algorithm_type = "dpmsolver++",
- solver_type = "midpoint",
- lower_order_final = True,
- device = 'cuda',
- steps_offset = 0,
- prediction_type = 'epsilon'
- ):
- # this schedule is very specific to the latent diffusion model.
- self.betas = (
- torch.linspace(beta_start**0.5, beta_end**0.5, num_train_timesteps, dtype=torch.float32) ** 2
- )
-
- self.device = device
- self.alphas = 1.0 - self.betas
- self.alphas_cumprod = torch.cumprod(self.alphas, dim=0)
- # Currently we only support VP-type noise schedule
- self.alpha_t = torch.sqrt(self.alphas_cumprod)
- self.sigma_t = torch.sqrt(1 - self.alphas_cumprod)
- self.lambda_t = torch.log(self.alpha_t) - torch.log(self.sigma_t)
- self.steps_offset = steps_offset
-
- # standard deviation of the initial noise distribution
- self.init_noise_sigma = 1.0
-
- self.algorithm_type = algorithm_type
- self.predict_epsilon = predict_epsilon
- self.thresholding = thresholding
- self.dynamic_thresholding_ratio = dynamic_thresholding_ratio
- self.sample_max_value = sample_max_value
- self.lower_order_final = lower_order_final
- self.prediction_type = prediction_type
-
- # settings for DPM-Solver
- if algorithm_type not in ["dpmsolver", "dpmsolver++"]:
- raise NotImplementedError(f"{algorithm_type} does is not implemented for {self.__class__}")
- if solver_type not in ["midpoint", "heun"]:
- raise NotImplementedError(f"{solver_type} does is not implemented for {self.__class__}")
-
- # setable values
- self.num_inference_steps = None
- self.solver_order = solver_order
- self.num_train_timesteps = num_train_timesteps
- self.solver_type = solver_type
-
- self.first_order_first_coef = []
- self.first_order_second_coef = []
-
- self.second_order_first_coef = []
- self.second_order_second_coef = []
- self.second_order_third_coef = []
-
- self.third_order_first_coef = []
- self.third_order_second_coef = []
- self.third_order_third_coef = []
- self.third_order_fourth_coef = []
-
- def scale_model_input(self, sample: torch.FloatTensor, *args, **kwargs) -> torch.FloatTensor:
- return sample
-
- def configure(self):
- lower_order_nums = 0
- for step_index in range(self.num_inference_steps):
- step_idx = step_index
- timestep = self.timesteps[step_idx]
-
- prev_timestep = 0 if step_idx == len(self.timesteps) - 1 else self.timesteps[step_idx + 1]
-
- self.dpm_solver_first_order_coefs_precompute(timestep, prev_timestep)
-
- timestep_list = [self.timesteps[step_index - 1], timestep]
- self.multistep_dpm_solver_second_order_coefs_precompute(timestep_list, prev_timestep)
-
- timestep_list = [self.timesteps[step_index - 2], self.timesteps[step_index - 1], timestep]
- self.multistep_dpm_solver_third_order_coefs_precompute(timestep_list, prev_timestep)
-
- if lower_order_nums < self.solver_order:
- lower_order_nums += 1
-
- def dpm_solver_first_order_coefs_precompute(self, timestep, prev_timestep):
- lambda_t, lambda_s = self.lambda_t[prev_timestep], self.lambda_t[timestep]
- alpha_t, alpha_s = self.alpha_t[prev_timestep], self.alpha_t[timestep]
- sigma_t, sigma_s = self.sigma_t[prev_timestep], self.sigma_t[timestep]
- h = lambda_t - lambda_s
- if self.algorithm_type == "dpmsolver++":
- self.first_order_first_coef.append(sigma_t / sigma_s)
- self.first_order_second_coef.append(alpha_t * (torch.exp(-h) - 1.0))
- elif self.algorithm_type == "dpmsolver":
- self.first_order_first_coef.append(alpha_t / alpha_s)
- self.first_order_second_coef.append(sigma_t * (torch.exp(h) - 1.0))
-
- def multistep_dpm_solver_second_order_coefs_precompute(self, timestep_list, prev_timestep):
- t, s0, s1 = prev_timestep, timestep_list[-1], timestep_list[-2]
- lambda_t, lambda_s0, lambda_s1 = self.lambda_t[t], self.lambda_t[s0], self.lambda_t[s1]
- alpha_t, alpha_s0 = self.alpha_t[t], self.alpha_t[s0]
- sigma_t, sigma_s0 = self.sigma_t[t], self.sigma_t[s0]
- h = lambda_t - lambda_s0
- if self.algorithm_type == "dpmsolver++":
- # See https://arxiv.org/abs/2211.01095 for detailed derivations
- if self.solver_type == "midpoint":
- self.second_order_first_coef.append(sigma_t / sigma_s0)
- self.second_order_second_coef.append((alpha_t * (torch.exp(-h) - 1.0)))
- self.second_order_third_coef.append(0.5 * (alpha_t * (torch.exp(-h) - 1.0)))
- elif self.solver_type == "heun":
- self.second_order_first_coef.append(sigma_t / sigma_s0)
- self.second_order_second_coef.append((alpha_t * (torch.exp(-h) - 1.0)))
- self.second_order_third_coef.append(alpha_t * ((torch.exp(-h) - 1.0) / h + 1.0))
- elif self.algorithm_type == "dpmsolver":
- # See https://arxiv.org/abs/2206.00927 for detailed derivations
- if self.solver_type == "midpoint":
- self.second_order_first_coef.append(alpha_t / alpha_s0)
- self.second_order_second_coef.append((sigma_t * (torch.exp(h) - 1.0)))
- self.second_order_third_coef.append(0.5 * (sigma_t * (torch.exp(h) - 1.0)))
- elif self.solver_type == "heun":
- self.second_order_first_coef.append(alpha_t / alpha_s0)
- self.second_order_second_coef.append((sigma_t * (torch.exp(h) - 1.0)))
- self.second_order_third_coef.append((sigma_t * ((torch.exp(h) - 1.0) / h - 1.0)))
-
- def multistep_dpm_solver_third_order_coefs_precompute(self, timestep_list, prev_timestep):
- t, s0 = prev_timestep, timestep_list[-1]
- lambda_t, lambda_s0 = (
- self.lambda_t[t],
- self.lambda_t[s0]
- )
- alpha_t, alpha_s0 = self.alpha_t[t], self.alpha_t[s0]
- sigma_t, sigma_s0 = self.sigma_t[t], self.sigma_t[s0]
- h = lambda_t - lambda_s0
- if self.algorithm_type == "dpmsolver++":
- self.third_order_first_coef.append(sigma_t / sigma_s0)
- self.third_order_second_coef.append(alpha_t * (torch.exp(-h) - 1.0))
- self.third_order_third_coef.append(alpha_t * ((torch.exp(-h) - 1.0) / h + 1.0))
- self.third_order_fourth_coef.append(alpha_t * ((torch.exp(-h) - 1.0 + h) / h**2 - 0.5))
- elif self.algorithm_type == "dpmsolver":
- self.third_order_first_coef.append(alpha_t / alpha_s0)
- self.third_order_second_coef.append(sigma_t * (torch.exp(h) - 1.0))
- self.third_order_third_coef.append(sigma_t * ((torch.exp(h) - 1.0) / h - 1.0))
- self.third_order_fourth_coef.append(sigma_t * ((torch.exp(h) - 1.0 - h) / h**2 - 0.5))
-
- def set_timesteps(self, num_inference_steps):
- self.num_inference_steps = num_inference_steps
- timesteps = (
- np.linspace(0, self.num_train_timesteps - 1, num_inference_steps + 1)
- .round()[::-1][:-1]
- .copy()
- .astype(np.int32)
- )
- self.timesteps = torch.from_numpy(timesteps).to(self.device)
- self.model_outputs = [
- None,
- ] * self.solver_order
- self.lower_order_nums = 0
-
- def convert_model_output(
- self, model_output, timestep, sample
- ):
- # DPM-Solver++ needs to solve an integral of the data prediction model.
- if self.algorithm_type == "dpmsolver++":
- if self.prediction_type == "epsilon":
- alpha_t, sigma_t = self.alpha_t[timestep], self.sigma_t[timestep]
- x0_pred = (sample - sigma_t * model_output) / alpha_t
- elif self.prediction_type == "v_prediction":
- alpha_t, sigma_t = self.alpha_t[timestep], self.sigma_t[timestep]
- x0_pred = alpha_t * sample - sigma_t * model_output
- else:
- raise ValueError(
- f"prediction_type given as {self.prediction_type} must be one of `epsilon`, or"
- " `v_prediction` for the DPMScheduler."
- )
-
- if self.thresholding:
- # Dynamic thresholding in https://arxiv.org/abs/2205.11487
- dynamic_max_val = torch.quantile(
- torch.abs(x0_pred).reshape((x0_pred.shape[0], -1)), self.dynamic_thresholding_ratio, dim=1
- )
- dynamic_max_val = torch.maximum(
- dynamic_max_val,
- self.sample_max_value * torch.ones_like(dynamic_max_val).to(dynamic_max_val.device),
- )[(...,) + (None,) * (x0_pred.ndim - 1)]
- x0_pred = torch.clamp(x0_pred, -dynamic_max_val, dynamic_max_val) / dynamic_max_val
- return x0_pred
- # DPM-Solver needs to solve an integral of the noise prediction model.
- elif self.algorithm_type == "dpmsolver":
- if self.prediction_type == "epsilon":
- return model_output
- elif self.prediction_type == "v_prediction":
- alpha_t, sigma_t = self.alpha_t[timestep], self.sigma_t[timestep]
- epsilon = alpha_t * model_output + sigma_t * sample
- return epsilon
- else:
- raise ValueError(
- f"prediction_type given as {self.prediction_type} must be one of `epsilon` or"
- " `v_prediction` for the DPMScheduler."
- )
-
- def dpm_solver_first_order_update(
- self,
- idx,
- model_output,
- sample
- ):
- first_coef = self.first_order_first_coef[idx]
- second_coef = self.first_order_second_coef[idx]
-
- if self.algorithm_type == "dpmsolver++":
- x_t = first_coef * sample - second_coef * model_output
- elif self.algorithm_type == "dpmsolver":
- x_t = first_coef * sample - second_coef * model_output
- return x_t
-
- def multistep_dpm_solver_second_order_update(
- self,
- idx,
- model_output_list,
- timestep_list,
- prev_timestep,
- sample
- ):
- t, s0, s1 = prev_timestep, timestep_list[-1], timestep_list[-2]
- m0, m1 = model_output_list[-1], model_output_list[-2]
- lambda_t, lambda_s0, lambda_s1 = self.lambda_t[t], self.lambda_t[s0], self.lambda_t[s1]
- h, h_0 = lambda_t - lambda_s0, lambda_s0 - lambda_s1
- r0 = h_0 / h
- D0, D1 = m0, (1.0 / r0) * (m0 - m1)
-
- first_coef = self.second_order_first_coef[idx]
- second_coef = self.second_order_second_coef[idx]
- third_coef = self.second_order_third_coef[idx]
-
- if self.algorithm_type == "dpmsolver++":
- # See https://arxiv.org/abs/2211.01095 for detailed derivations
- if self.solver_type == "midpoint":
- x_t = (
- first_coef * sample
- - second_coef * D0
- - third_coef * D1
- )
- elif self.solver_type == "heun":
- x_t = (
- first_coef * sample
- - second_coef * D0
- + third_coef * D1
- )
- elif self.algorithm_type == "dpmsolver":
- # See https://arxiv.org/abs/2206.00927 for detailed derivations
- if self.solver_type == "midpoint":
- x_t = (
- first_coef * sample
- - second_coef * D0
- - third_coef * D1
- )
- elif self.solver_type == "heun":
- x_t = (
- first_coef * sample
- - second_coef * D0
- - third_coef * D1
- )
- return x_t
-
- def multistep_dpm_solver_third_order_update(
- self,
- idx,
- model_output_list,
- timestep_list,
- prev_timestep,
- sample
- ):
- t, s0, s1, s2 = prev_timestep, timestep_list[-1], timestep_list[-2], timestep_list[-3]
- m0, m1, m2 = model_output_list[-1], model_output_list[-2], model_output_list[-3]
- lambda_t, lambda_s0, lambda_s1, lambda_s2 = (
- self.lambda_t[t],
- self.lambda_t[s0],
- self.lambda_t[s1],
- self.lambda_t[s2],
- )
- h, h_0, h_1 = lambda_t - lambda_s0, lambda_s0 - lambda_s1, lambda_s1 - lambda_s2
- r0, r1 = h_0 / h, h_1 / h
- D0 = m0
- D1_0, D1_1 = (1.0 / r0) * (m0 - m1), (1.0 / r1) * (m1 - m2)
- D1 = D1_0 + (r0 / (r0 + r1)) * (D1_0 - D1_1)
- D2 = (1.0 / (r0 + r1)) * (D1_0 - D1_1)
-
- first_coef = self.third_order_first_coef[idx]
- second_coef = self.third_order_second_coef[idx]
- third_coef = self.third_order_third_coef[idx]
- fourth_coef = self.third_order_fourth_coef[idx]
-
- if self.algorithm_type == "dpmsolver++":
- # See https://arxiv.org/abs/2206.00927 for detailed derivations
- x_t = (
- first_coef * sample
- - second_coef * D0
- + third_coef * D1
- - fourth_coef * D2
- )
- elif self.algorithm_type == "dpmsolver":
- # See https://arxiv.org/abs/2206.00927 for detailed derivations
- x_t = (
- first_coef * sample
- - second_coef * D0
- - third_coef * D1
- - fourth_coef * D2
- )
- return x_t
-
- def step(self, output, latents, step_index, timestep):
- if self.num_inference_steps is None:
- raise ValueError(
- "Number of inference steps is 'None', you need to run 'set_timesteps' after creating the scheduler"
- )
-
- prev_timestep = 0 if step_index == len(self.timesteps) - 1 else self.timesteps[step_index + 1]
- lower_order_final = (
- (step_index == len(self.timesteps) - 1) and self.lower_order_final and len(self.timesteps) < 15
- )
- lower_order_second = (
- (step_index == len(self.timesteps) - 2) and self.lower_order_final and len(self.timesteps) < 15
- )
-
- output = self.convert_model_output(output, timestep, latents)
- for i in range(self.solver_order - 1):
- self.model_outputs[i] = self.model_outputs[i + 1]
- self.model_outputs[-1] = output
-
- if self.solver_order == 1 or self.lower_order_nums < 1 or lower_order_final:
- prev_sample = self.dpm_solver_first_order_update(step_index, output, latents)
- elif self.solver_order == 2 or self.lower_order_nums < 2 or lower_order_second:
- timestep_list = [self.timesteps[step_index - 1], timestep]
- prev_sample = self.multistep_dpm_solver_second_order_update(
- step_index, self.model_outputs, timestep_list, prev_timestep, latents
- )
- else:
- timestep_list = [self.timesteps[step_index - 2], self.timesteps[step_index - 1], timestep]
- prev_sample = self.multistep_dpm_solver_third_order_update(
- step_index, self.model_outputs, timestep_list, prev_timestep, latents
- )
-
- if self.lower_order_nums < self.solver_order:
- self.lower_order_nums += 1
-
- return prev_sample
-
- def add_noise(self, init_latents, noise, idx, latent_timestep):
- self.alphas_cumprod = self.alphas_cumprod.to(device=init_latents.device, dtype=init_latents.dtype)
- timestep = latent_timestep.to(init_latents.device).long()
-
- sqrt_alpha_prod = self.alphas_cumprod[timestep] ** 0.5
- sqrt_one_minus_alpha_prod = (1 - self.alphas_cumprod[timestep]) ** 0.5
- noisy_latents = sqrt_alpha_prod * init_latents + sqrt_one_minus_alpha_prod * noise
-
- return noisy_latents
-
-
-class PNDMScheduler():
- def __init__(
- self,
- device = 'cuda',
- beta_start = 0.00085,
- beta_end = 0.012,
- num_train_timesteps = 1000,
- steps_offset: int = 0,
- prediction_type = 'epsilon'
- ):
- self.device = device
- self.num_train_timesteps = num_train_timesteps
- self.pndm_order = 4
-
- self.beta_start = beta_start
- self.beta_end = beta_end
- betas = (torch.linspace(beta_start**0.5, beta_end**0.5, self.num_train_timesteps, dtype=torch.float32) ** 2)
- alphas = 1.0 - betas
- self.alphas_cumprod = torch.cumprod(alphas, dim=0).to(device=self.device)
- self.final_alpha_cumprod = self.alphas_cumprod[0]
-
- # standard deviation of the initial noise distribution
- self.init_noise_sigma = 1.0
- self.steps_offset = steps_offset
-
- # running values
- self.counter = 0
- self.cur_sample = None
- self.ets = []
- self.prediction_type = prediction_type
-
- def set_timesteps(self, steps):
- self.num_inference_steps = steps
-
- self.step_ratio = self.num_train_timesteps // self.num_inference_steps
- # creates integer timesteps by multiplying by ratio
- timesteps = (np.arange(0, self.num_inference_steps) * self.step_ratio).round()
- timesteps += self.steps_offset
-
- # for some models like stable diffusion the prk steps can/should be skipped to produce better results
- plms_timesteps = np.concatenate([timesteps[:-1], timesteps[-2:-1], timesteps[-1:]])[::-1].copy()
- self.timesteps = torch.from_numpy(plms_timesteps).to(self.device)
-
- # reset running values
- self.counter = 0
- self.cur_sample = None
- self.ets = []
-
- def scale_model_input(self, sample: torch.FloatTensor, idx, *args, **kwargs) -> torch.FloatTensor:
- return sample
-
- def configure(self):
- self.alphas_cumprod_prev = torch.roll(self.alphas_cumprod, shifts=self.step_ratio)
- self.alphas_cumprod_prev[:self.step_ratio] = self.final_alpha_cumprod
- self.sample_coeff = (self.alphas_cumprod_prev / self.alphas_cumprod) ** (0.5)
-
- self.beta_cumprod = 1 - self.alphas_cumprod
- self.beta_cumprod_prev = 1 - self.alphas_cumprod_prev
- self.model_output_denom_coeff = self.alphas_cumprod * (self.beta_cumprod_prev) ** (0.5) + (
- self.alphas_cumprod * self.beta_cumprod * self.alphas_cumprod_prev) ** (0.5)
-
- timesteps = self.timesteps.cpu().long()
-
- self.alphas_cumprod = self.alphas_cumprod[timesteps]
- self.beta_cumprod = self.beta_cumprod[timesteps]
- self.alphas_cumprod_prev = self.alphas_cumprod_prev[timesteps]
- self.sample_coeff = self.sample_coeff[timesteps]
- self.model_output_denom_coeff = self.model_output_denom_coeff[timesteps]
-
- def step(self, output, sample, idx, timestep):
- # step_plms: propagate the sample with the linear multi-step method. This has one forward pass with multiple
- # times to approximate the solution.
-
- # prev_timestep = timestep - self.step_ratio
-
- if self.counter != 1:
- self.ets = self.ets[-3:]
- self.ets.append(output)
- # else:
- # prev_timestep = timestep
- # timestep = timestep + self.step_ratio
-
- if len(self.ets) == 1 and self.counter == 0:
- output = output
- self.cur_sample = sample
- elif len(self.ets) == 1 and self.counter == 1:
- output = (output + self.ets[-1]) / 2
- sample = self.cur_sample
- self.cur_sample = None
- elif len(self.ets) == 2:
- output = (3 * self.ets[-1] - self.ets[-2]) / 2
- elif len(self.ets) == 3:
- output = (23 * self.ets[-1] - 16 * self.ets[-2] + 5 * self.ets[-3]) / 12
- else:
- output = (1 / 24) * (55 * self.ets[-1] - 59 * self.ets[-2] + 37 * self.ets[-3] - 9 * self.ets[-4])
-
- if self.prediction_type == "v_prediction":
- output = (self.alphas_cumprod[idx]**0.5) * output + (self.beta_cumprod[idx]**0.5) * sample
- elif self.prediction_type != "epsilon":
- raise ValueError(
- f"prediction_type given as {self.prediction_type} must be one of `epsilon` or `v_prediction`"
- )
-
- prev_sample = (
- self.sample_coeff[idx] * sample - (self.alphas_cumprod_prev[idx] - self.alphas_cumprod[idx]) * output / self.model_output_denom_coeff[idx]
- )
- self.counter += 1
-
- return prev_sample
-
- def add_noise(self, init_latents, noise, idx, latent_timestep):
- sqrt_alpha_prod = self.alphas_cumprod[idx] ** 0.5
- sqrt_one_minus_alpha_prod = (1 - self.alphas_cumprod[idx]) ** 0.5
- noisy_latents = sqrt_alpha_prod * init_latents + sqrt_one_minus_alpha_prod * noise
-
- return noisy_latents
-
def save_image(images, image_path_dir, image_name_prefix):
"""
Save the generated images to png files.
@@ -1178,42 +372,206 @@ def download_image(url):
response = requests.get(url)
return Image.open(BytesIO(response.content)).convert("RGB")
+def get_refit_weights(state_dict, onnx_opt_path, weight_name_mapping, weight_shape_mapping):
+ onnx_opt_dir = os.path.dirname(onnx_opt_path)
+ onnx_opt_model = onnx.load(onnx_opt_path)
+ # Create initializer data hashes
+ initializer_hash_mapping = {}
+ for initializer in onnx_opt_model.graph.initializer:
+ initializer_data = numpy_helper.to_array(initializer, base_dir=onnx_opt_dir).astype(np.float16)
+ initializer_hash = hash(initializer_data.data.tobytes())
+ initializer_hash_mapping[initializer.name] = initializer_hash
+
+ refit_weights = OrderedDict()
+ for wt_name, wt in state_dict.items():
+ # query initializer to compare
+ initializer_name = weight_name_mapping[wt_name]
+ initializer_hash = initializer_hash_mapping[initializer_name]
+
+ # get shape transform info
+ initializer_shape, is_transpose = weight_shape_mapping[wt_name]
+ if is_transpose:
+ wt = torch.transpose(wt, 0, 1)
+ else:
+ wt = torch.reshape(wt, initializer_shape)
+
+ # include weight if hashes differ
+ wt_hash = hash(wt.cpu().detach().numpy().astype(np.float16).data.tobytes())
+ if initializer_hash != wt_hash:
+ refit_weights[initializer_name] = wt.contiguous()
+ return refit_weights
+
+def load_calib_prompts(batch_size, calib_data_path):
+ with open(calib_data_path, "r") as file:
+ lst = [line.rstrip("\n") for line in file]
+ return [lst[i : i + batch_size] for i in range(0, len(lst), batch_size)]
+
+def filter_func(name):
+ pattern = re.compile(
+ r".*(time_emb_proj|time_embedding|conv_in|conv_out|conv_shortcut|add_embedding).*"
+ )
+ return pattern.match(name) is not None
+
+def quantize_lvl(unet, quant_level=2.5):
+ """
+ We should disable the unwanted quantizer when exporting the onnx
+ Because in the current ammo setting, it will load the quantizer amax for all the layers even
+ if we didn't add that unwanted layer into the config during the calibration
+ """
+ for name, module in unet.named_modules():
+ if isinstance(module, torch.nn.Conv2d):
+ module.input_quantizer.enable()
+ module.weight_quantizer.enable()
+ elif isinstance(module, torch.nn.Linear):
+ if (
+ (quant_level >= 2 and "ff.net" in name)
+ or (quant_level >= 2.5 and ("to_q" in name or "to_k" in name or "to_v" in name))
+ or quant_level == 3
+ ):
+ module.input_quantizer.enable()
+ module.weight_quantizer.enable()
+ else:
+ module.input_quantizer.disable()
+ module.weight_quantizer.disable()
+
+def get_smoothquant_config(model, quant_level=3):
+ quant_config = {
+ "quant_cfg": {},
+ "algorithm": "smoothquant",
+ }
+ for name, module in model.named_modules():
+ w_name = f"{name}*weight_quantizer"
+ i_name = f"{name}*input_quantizer"
+
+ if (
+ w_name in quant_config["quant_cfg"].keys() # type: ignore
+ or i_name in quant_config["quant_cfg"].keys() # type: ignore
+ ):
+ continue
+ if filter_func(name):
+ continue
+ if isinstance(module, torch.nn.Linear):
+ if (
+ (quant_level >= 2 and "ff.net" in name)
+ or (quant_level >= 2.5 and ("to_q" in name or "to_k" in name or "to_v" in name))
+ or quant_level == 3
+ ):
+ quant_config["quant_cfg"][w_name] = {"num_bits": 8, "axis": 0} # type: ignore
+ quant_config["quant_cfg"][i_name] = {"num_bits": 8, "axis": -1} # type: ignore
+ elif isinstance(module, torch.nn.Conv2d):
+ quant_config["quant_cfg"][w_name] = {"num_bits": 8, "axis": 0} # type: ignore
+ quant_config["quant_cfg"][i_name] = {"num_bits": 8, "axis": None} # type: ignore
+ return quant_config
+
+class PercentileAmaxes:
+ def __init__(self, total_step, percentile) -> None:
+ self.data = {}
+ self.total_step = total_step
+ self.percentile = percentile
+ self.i = 0
+
+ def append(self, item):
+ _cur_step = self.i % self.total_step
+ if _cur_step not in self.data.keys():
+ self.data[_cur_step] = item
+ else:
+ self.data[_cur_step] = np.maximum(self.data[_cur_step], item)
+ self.i += 1
+
def add_arguments(parser):
# Stable Diffusion configuration
- parser.add_argument('--version', type=str, default="2.1", choices=["1.4", "1.5", "2.0", "2.0-base", "2.1", "2.1-base"], help="Version of Stable Diffusion")
+ parser.add_argument('--version', type=str, default="1.5", choices=["1.4", "1.5", "dreamshaper-7", "2.0-base", "2.0", "2.1-base", "2.1", "xl-1.0", "xl-turbo"], help="Version of Stable Diffusion")
parser.add_argument('prompt', nargs = '*', help="Text prompt(s) to guide image generation")
parser.add_argument('--negative-prompt', nargs = '*', default=[''], help="The negative prompt(s) to guide the image generation.")
- parser.add_argument('--repeat-prompt', type=int, default=1, choices=[1, 2, 4, 8, 16], help="Number of times to repeat the prompt (batch size multiplier)")
+ parser.add_argument('--batch-size', type=int, default=1, choices=[1, 2, 4], help="Batch size (repeat prompt)")
+ parser.add_argument('--batch-count', type=int, default=1, help="Number of images to generate in sequence, one at a time.")
parser.add_argument('--height', type=int, default=512, help="Height of image to generate (must be multiple of 8)")
parser.add_argument('--width', type=int, default=512, help="Height of image to generate (must be multiple of 8)")
- parser.add_argument('--denoising-steps', type=int, default=50, help="Number of denoising steps")
+ parser.add_argument('--denoising-steps', type=int, default=30, help="Number of denoising steps")
+ parser.add_argument('--scheduler', type=str, default=None, choices=["DDIM", "DDPM", "EulerA", "Euler", "LCM", "LMSD", "PNDM", "UniPC"], help="Scheduler for diffusion process")
+ parser.add_argument('--guidance-scale', type=float, default=7.5, help="Value of classifier-free guidance scale (must be greater than 1)")
+ parser.add_argument('--lora-scale', type=float, nargs='+', default=None, help="Scale of LoRA weights, default 1 (must between 0 and 1)")
+ parser.add_argument('--lora-path', type=str, nargs='+', default=None, help="Path to LoRA adaptor. Ex: 'latent-consistency/lcm-lora-sdv1-5'")
# ONNX export
- parser.add_argument('--onnx-opset', type=int, default=17, choices=range(7,18), help="Select ONNX opset version to target for exported models")
+ parser.add_argument('--onnx-opset', type=int, default=19, choices=range(7,20), help="Select ONNX opset version to target for exported models")
parser.add_argument('--onnx-dir', default='onnx', help="Output directory for ONNX export")
- parser.add_argument('--onnx-refit-dir', help="ONNX models to load the weights from")
- parser.add_argument('--force-onnx-export', action='store_true', help="Force ONNX export of CLIP, UNET, and VAE models")
- parser.add_argument('--force-onnx-optimize', action='store_true', help="Force ONNX optimizations for CLIP, UNET, and VAE models")
+
+ # Framework model ckpt
+ parser.add_argument('--framework-model-dir', default='pytorch_model', help="Directory for HF saved models")
# TensorRT engine build
parser.add_argument('--engine-dir', default='engine', help="Output directory for TensorRT engines")
- parser.add_argument('--force-engine-build', action='store_true', help="Force rebuilding the TensorRT engine")
+ parser.add_argument('--int8', action='store_true', help="Apply int8 quantization.")
+ parser.add_argument('--quantization-level', type=float, default=3.0, choices=range(1,4), help="int8/fp8 quantization level, 1: CNN, 2: CNN+FFN, 2.5: CNN+FFN+QKV, 3: CNN+FC")
parser.add_argument('--build-static-batch', action='store_true', help="Build TensorRT engines with fixed batch size.")
parser.add_argument('--build-dynamic-shape', action='store_true', help="Build TensorRT engines with dynamic image shapes.")
parser.add_argument('--build-enable-refit', action='store_true', help="Enable Refit option in TensorRT engines during build.")
- parser.add_argument('--build-preview-features', action='store_true', help="Build TensorRT engines with preview features.")
parser.add_argument('--build-all-tactics', action='store_true', help="Build TensorRT engines using all tactic sources.")
parser.add_argument('--timing-cache', default=None, type=str, help="Path to the precached timing measurements to accelerate build.")
# TensorRT inference
parser.add_argument('--num-warmup-runs', type=int, default=5, help="Number of warmup runs before benchmarking performance")
- parser.add_argument('--nvtx-profile', action='store_true', help="Enable NVTX markers for performance profiling")
- parser.add_argument('--seed', type=int, default=None, help="Seed for random generator to get consistent results")
parser.add_argument('--use-cuda-graph', action='store_true', help="Enable cuda graph")
+ parser.add_argument('--nvtx-profile', action='store_true', help="Enable NVTX markers for performance profiling")
+ parser.add_argument('--torch-inference', default='', help="Run inference with PyTorch (using specified compilation mode) instead of TensorRT.")
+ parser.add_argument('--seed', type=int, default=None, help="Seed for random generator to get consistent results")
parser.add_argument('--output-dir', default='output', help="Output directory for logs and image artifacts")
parser.add_argument('--hf-token', type=str, help="HuggingFace API access token for downloading model checkpoints")
parser.add_argument('-v', '--verbose', action='store_true', help="Show verbose output")
return parser
-
+def process_pipeline_args(args):
+ if args.height % 8 != 0 or args.width % 8 != 0:
+ raise ValueError(f"Image height and width have to be divisible by 8 but specified as: {args.image_height} and {args.width}.")
+
+ max_batch_size = 4
+ if args.batch_size > max_batch_size:
+ raise ValueError(f"Batch size {args.batch_size} is larger than allowed {max_batch_size}.")
+
+ if args.use_cuda_graph and (not args.build_static_batch or args.build_dynamic_shape):
+ raise ValueError(f"Using CUDA graph requires static dimensions. Enable `--build-static-batch` and do not specify `--build-dynamic-shape`")
+
+ if args.int8 and not args.version.startswith('xl'):
+ raise ValueError(f"int8 quantization only supported for SDXL pipeline.")
+
+ if args.lora_scale:
+ for lora_scale in (lora_scale for lora_scale in args.lora_scale if not 0 <= lora_scale <= 1):
+ raise ValueError(f"Scale of LoRA weights must be between 0 and 1, provided {lora_scale}")
+
+ kwargs_init_pipeline = {
+ 'version': args.version,
+ 'max_batch_size': max_batch_size,
+ 'denoising_steps': args.denoising_steps,
+ 'scheduler': args.scheduler,
+ 'guidance_scale': args.guidance_scale,
+ 'output_dir': args.output_dir,
+ 'hf_token': args.hf_token,
+ 'verbose': args.verbose,
+ 'nvtx_profile': args.nvtx_profile,
+ 'use_cuda_graph': args.use_cuda_graph,
+ 'lora_scale': args.lora_scale,
+ 'lora_path': args.lora_path,
+ 'framework_model_dir': args.framework_model_dir,
+ 'torch_inference': args.torch_inference,
+ }
+
+ kwargs_load_engine = {
+ 'onnx_opset': args.onnx_opset,
+ 'opt_batch_size': args.batch_size,
+ 'opt_image_height': args.height,
+ 'opt_image_width': args.width,
+ 'static_batch': args.build_static_batch,
+ 'static_shape': not args.build_dynamic_shape,
+ 'enable_all_tactics': args.build_all_tactics,
+ 'enable_refit': args.build_enable_refit,
+ 'timing_cache': args.timing_cache,
+ 'int8': args.int8,
+ 'quantization_level': args.quantization_level,
+ 'denoising_steps': args.denoising_steps,
+ }
+
+ args_run_demo = (args.prompt, args.negative_prompt, args.height, args.width, args.batch_size, args.batch_count, args.num_warmup_runs, args.use_cuda_graph)
+
+ return kwargs_init_pipeline, kwargs_load_engine, args_run_demo
diff --git a/demo/HuggingFace/.gitignore b/demo/HuggingFace/.gitignore
deleted file mode 100644
index 18a62d0f..00000000
--- a/demo/HuggingFace/.gitignore
+++ /dev/null
@@ -1,3 +0,0 @@
-*.pyc
-__pycache__/
-**/temp/
\ No newline at end of file
diff --git a/demo/HuggingFace/BART/BARTModelConfig.py b/demo/HuggingFace/BART/BARTModelConfig.py
deleted file mode 100755
index f8ea3bd7..00000000
--- a/demo/HuggingFace/BART/BARTModelConfig.py
+++ /dev/null
@@ -1,306 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-import argparse
-
-from collections import namedtuple, OrderedDict
-from itertools import product
-from typing import Dict
-
-# TRT-HuggingFace
-from NNDF.networks import Precision, NetworkMetadata, NNConfig, Dims
-from NNDF.interface import MetadataArgparseInteropMixin
-
-# Limitation of namedtuples. You must declare namedtuples in module scope and not in classes.
-# Otherwise pickle doesn't work.
-# See: https://stackoverflow.com/questions/4677012/python-cant-pickle-type-x-attribute-lookup-failed
-_BARTMetadata = namedtuple("BARTMetadata", ["kv_cache"])
-
-
-class BARTMetadata(_BARTMetadata, MetadataArgparseInteropMixin):
- @staticmethod
- def add_args(parser: argparse.ArgumentParser) -> None:
- """Add commandline interface parser."""
- network_group = parser.add_argument_group("BART network")
- network_group.add_argument(
- "--variant",
- help="BART variant to generate",
- choices=BARTModelTRTConfig.TARGET_MODELS,
- required=True,
- )
- network_group.add_argument(
- "--enable-kv-cache",
- help="BART enable KV cache",
- action="store_true",
- default=False,
- )
- network_group.add_argument(
- "--num-beams", type=int, default=1, help="Enables beam search during decoding."
- )
-
- @staticmethod
- def from_args(args: argparse.Namespace):
- return NetworkMetadata(
- variant=args.variant,
- precision=Precision(fp16=False),
- other=BARTMetadata(kv_cache=args.enable_kv_cache),
- )
-
- @staticmethod
- def add_inference_args(parser: argparse.ArgumentParser) -> None:
- BARTMetadata.add_args(parser)
- inference_group = parser.add_argument_group("inference group")
- inference_group.add_argument(
- "--fp16", action="store_true", help="Enables fp16 TensorRT tactics."
- )
-
- @staticmethod
- def from_inference_args(args: argparse.Namespace):
- base_metadata = BARTMetadata.from_args(args)
- return base_metadata._replace(precision=Precision(fp16=args.fp16))
-
-
- @staticmethod
- def add_benchmarking_args(parser: argparse.ArgumentParser) -> None:
- benchmarking_group = parser.add_argument_group("benchmarking group")
- benchmarking_group.add_argument(
- "--input-seq-len",
- type=int,
- help="Specify fixed input sequence length for perf benchmarking. Required for benchmark except when both input_profile_max and output_profile_max are provided for trt",
- )
- benchmarking_group.add_argument(
- "--output-seq-len",
- type=int,
- help="Specify fixed output sequence length for perf benchmarking. Required for benchmark except when both input_profile_max and output_profile_max are provided for trt",
- )
-
-BARTBenchmarkingArgs = namedtuple("BARTBenchmarkingArgs", ["input_seq_len", "output_seq_len"])
-
-# trt has more benchmarking arguments
-BARTTRTBenchmarkingArgs = namedtuple("BARTTRTBenchmarkingArgs", ["input_seq_len", "output_seq_len", "input_profile_max_len", "output_profile_max_len"])
-
-class BARTModelTRTConfig(NNConfig):
-
- TARGET_MODELS = ["facebook/bart-base", "facebook/bart-large", "facebook/bart-large-cnn", "facebook/mbart-large-50"]
-
- MAX_DECODER_WORKSPACE_MB = {
- TARGET_MODELS[0]: 3072,
- TARGET_MODELS[1]: 3072,
- TARGET_MODELS[2]: 3072,
- TARGET_MODELS[3]: 3072,
- }
-
- # bart-base: 12-layer, 768-hidden, 139M parameters
- # bart-large: 24-layer, 1024-hidden, 406M parameters
- # in all bart variants, # of encoder layers and # of decoder layers are the same
- NUMBER_OF_LAYERS = {
- TARGET_MODELS[0]: 12,
- TARGET_MODELS[1]: 24,
- TARGET_MODELS[2]: 24,
- TARGET_MODELS[3]: 24,
- }
-
- NUMBER_OF_DECODER_LAYERS = {
- TARGET_MODELS[0]: 6,
- TARGET_MODELS[1]: 12,
- TARGET_MODELS[2]: 12,
- TARGET_MODELS[3]: 12,
- }
-
- # in all bart variants, # of heads in encoder and decoder are the same
- NUMBER_OF_HEADS = {
- TARGET_MODELS[0]: 12,
- TARGET_MODELS[1]: 16,
- TARGET_MODELS[2]: 16,
- TARGET_MODELS[3]: 16,
- }
-
- MAX_SEQUENCE_LENGTH = {
- TARGET_MODELS[0]: 768,
- TARGET_MODELS[1]: 1024,
- TARGET_MODELS[2]: 1024,
- TARGET_MODELS[3]: 1024,
- }
-
- # encoder hidden size is not necessarily same as max sequence length. Separate for clarification
- ENCODER_HIDDEN_SIZE = {
- TARGET_MODELS[0]: 768,
- TARGET_MODELS[1]: 1024,
- TARGET_MODELS[2]: 1024,
- TARGET_MODELS[3]: 1024,
- }
-
- # To achieve identical results with original HuggingFace implementation, the min_length in model config should be consistent with each model variant
- # see task-specific params in config.json of each variant model
- MIN_OUTPUT_LENGTH = {
- TARGET_MODELS[0]: 0,
- TARGET_MODELS[1]: 0,
- TARGET_MODELS[2]: 56,
- TARGET_MODELS[3]: 0,
- }
-
- #TODO: this might better be an inference time input like the `max_length` arg in generate() and greedy_search(). The change needed is in NNDF/interface.py:__call__ so it's a fundamental change affecting GPT2 and T5 code. Here I just put this option in BART model config for now. But it's also reasonable to treat this as a model config, because the TRT engine building may need this to have fixed dimension (e.g., to enable KV-cache)
- # see task-specific params in config.json of each variant model
- MAX_OUTPUT_LENGTH = {
- TARGET_MODELS[0]: 768,
- TARGET_MODELS[1]: 1024,
- TARGET_MODELS[2]: 142,
- TARGET_MODELS[3]: 200,
- }
-
- # BART specific configs: https://huggingface.co/facebook/bart-base/blob/main/config.json
- NO_REPEAT_NGRAM_SIZE = 3
- BOS_TOKEN_ID = 0
- EOS_TOKEN_ID = 2
-
- VOCAB_SIZE = {
- TARGET_MODELS[0]: 50265,
- TARGET_MODELS[1]: 50265,
- TARGET_MODELS[2]: 50264, # for bart-large-cnn config it's 50264 somehow. If not change here, results are incorrect since the trt results dimension reshape depends on this
- TARGET_MODELS[3]: 250054 # for mbart multilingual models, vocab size is much larger
- }
-
- NETWORK_FULL_NAME = "full"
- NETWORK_DECODER_SEGMENT_NAME = "decoder"
- NETWORK_ENCODER_SEGMENT_NAME = "encoder"
- NETWORK_SEGMENTS = [NETWORK_DECODER_SEGMENT_NAME, NETWORK_ENCODER_SEGMENT_NAME]
-
- def __init__(self):
- precision_fp16 = [False, True]
- kv_caches = [False, True]
-
- variants = []
- for variant, fp16, kv_cache in product(
- BARTModelTRTConfig.TARGET_MODELS, precision_fp16, kv_caches
- ):
- variants.append(
- NetworkMetadata(
- variant=variant,
- precision=Precision(fp16=fp16),
- other=BARTMetadata(kv_cache=kv_cache),
- )
- )
-
- super().__init__("BART", variants=variants)
-
- def get_python_requirements(self):
- base_requirements = super().get_python_requirements()
- base_requirements.append("transformers==4.8.0")
- return base_requirements
-
- def get_network_segments(self):
- """
- Returns exportable segments for the given network.
- Used in the case where a single network needs to
- be exported into multiple parts.
- """
- return BARTModelTRTConfig.NETWORK_SEGMENTS
-
- def get_metadata_string(self, metadata: NetworkMetadata) -> str:
- # Remove redundant bart name prefix
- if "mbart" in metadata.variant:
- metadata = metadata._replace(variant=metadata.variant.replace("facebook/mbart-","mbart-"))
- else:
- metadata = metadata._replace(variant=metadata.variant.replace("facebook/bart-",""))
- return super().get_metadata_string(metadata)
-
- @staticmethod
- def get_input_dims(metadata) -> Dict:
- """
- Returns dictionary encoding of input dimensions.
- Keys will be equal to get_model_segments()
-
- Returns:
- (Dict[str, Dims]): {"decoder": Dims, "encoder": Dims}
- """
- decoder_inputs_dict = OrderedDict(
- {
- "input_ids": (Dims.BATCH, Dims.SEQUENCE),
- "encoder_hidden_states": (
- Dims.BATCH,
- Dims.create_new_sequence_dim("encoder_hidden_length"),
- BARTModelTRTConfig.ENCODER_HIDDEN_SIZE[metadata.variant], # dim not containing string 'Dims.BATCH' or 'Dims.SEQUENCE' will be non-dynamic axis
- ),
- }
- )
- if metadata.other.kv_cache:
- # for KV cache version, we need add per-layer KV cache inputs. `past_key_values` at each layer is (self-attention K, self-attention V, cross-attention K, cross-attention V)
- for i in range(BARTModelTRTConfig.NUMBER_OF_DECODER_LAYERS[metadata.variant]):
- # decoder self-attention KV cache (dim[0] & dim[2] are dynamic, and dim[2] varies at each decoding timestep)
- self_attention_past_kv_dims = (Dims.BATCH, "num_heads", Dims.create_new_sequence_dim("past_decoder_length"), "embedding_size_per_head")
- decoder_inputs_dict[f"past_key_values.{i}.decoder.key"] = self_attention_past_kv_dims
- decoder_inputs_dict[f"past_key_values.{i}.decoder.value"] = self_attention_past_kv_dims
-
- # encoder-decoder cross-attention KV cache (dim[0] & dim[2] are dynamic, but dim[2] is constant at each decoding timestep)
- cross_attention_past_kv_dims = (Dims.BATCH, "num_heads", Dims.create_new_sequence_dim("encoder_length"), "embedding_size_per_head")
- decoder_inputs_dict[f"past_key_values.{i}.encoder.key"] = cross_attention_past_kv_dims
- decoder_inputs_dict[f"past_key_values.{i}.encoder.value"] = cross_attention_past_kv_dims
-
- decoder_inputs = Dims(decoder_inputs_dict)
-
- encoder_inputs = Dims(OrderedDict({"input_ids": (Dims.BATCH, Dims.SEQUENCE)}))
-
- return {
- BARTModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME: decoder_inputs,
- BARTModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME: encoder_inputs,
- }
-
- @staticmethod
- def get_output_dims(metadata) -> Dict:
- """
- Returns dictionary encoding of output dimensions.
- Keys will be equal to get_model_segments()
-
- Returns:
- (Dict[str, Dims]): {"decoder": Dims, "encoder": Dims}
- """
- decoder_outputs_dict = OrderedDict(
- {"hidden_states": (Dims.BATCH, Dims.SEQUENCE)})
-
- if metadata.other.kv_cache:
- # for KV cache version, we need add per-layer KV cache inputs. `past_key_values` at each layer is (self-attention K, self-attention V, cross-attention K, cross-attention V)
-
- # for all BART variants, # encoder layers = # decoder layers, so just divide total # layers by 2
- for i in range(BARTModelTRTConfig.NUMBER_OF_DECODER_LAYERS[metadata.variant]):
- # decoder self-attention KV cache (dim[0] & dim[2] are dynamic, and dim[2] varies at each decoding timestep)
- self_attention_present_kv_dims = (Dims.BATCH, "num_heads", Dims.create_new_sequence_dim("decoder_length"), "embedding_size_per_head")
- decoder_outputs_dict[f"present_key_values.{i}.decoder.key"] = self_attention_present_kv_dims
- decoder_outputs_dict[f"present_key_values.{i}.decoder.value"] = self_attention_present_kv_dims
-
- # encoder-decoder cross-attention KV cache (dim[0] & dim[2] are dynamic, but dim[2] is constant at each decoding timestep)
- cross_attention_present_kv_dims = (Dims.BATCH, "num_heads", Dims.create_new_sequence_dim("encoder_length"), "embedding_size_per_head")
- decoder_outputs_dict[f"present_key_values.{i}.encoder.key"] = cross_attention_present_kv_dims
- decoder_outputs_dict[f"present_key_values.{i}.encoder.value"] = cross_attention_present_kv_dims
-
- decoder_outputs = Dims(decoder_outputs_dict)
-
- encoder_outputs = Dims(
- OrderedDict(
- {
- "hidden_states": (
- Dims.BATCH,
- Dims.SEQUENCE,
- BARTModelTRTConfig.ENCODER_HIDDEN_SIZE[metadata.variant],
- )
- }
- )
- )
-
- return {
- BARTModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME: decoder_outputs,
- BARTModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME: encoder_outputs,
- }
diff --git a/demo/HuggingFace/BART/checkpoint.toml b/demo/HuggingFace/BART/checkpoint.toml
deleted file mode 100755
index 52add215..00000000
--- a/demo/HuggingFace/BART/checkpoint.toml
+++ /dev/null
@@ -1,26 +0,0 @@
-# Default requirements
-[BART.all.default.all.summarization]
-
-input = "NVIDIA TensorRT-based applications perform up to 36X faster than CPU-only platforms during inference, enabling developers to optimize neural network models trained on all major frameworks, calibrate for lower precision with high accuracy, and deploy to hyperscale data centers, embedded platforms, or automotive product platforms. TensorRT, built on the NVIDIA CUDA parallel programming model, enables developers to optimize inference by leveraging libraries, development tools, and technologies in CUDA-X for AI, autonomous machines, high performance computing, and graphics. With new NVIDIA Ampere Architecture GPUs, TensorRT also uses sparse tensor cores for an additional performance boost."
-
-[BART.all."facebook/bart-base".all.summarization]
-
-label = "NVIDIA TensorRT-based applications perform up to 36X faster than CPU-only platforms during inference, enabling developers to optimize neural network models trained on all major frameworks, calibrate for lower precision with high accuracy, and deploy to hyperscale data centers, embedded platforms, or automotive product platforms. TensorR, built on the NVIDIA CUDA parallel programming model, enables developers to accelerate inference by leveraging libraries, development tools, and technologies in CUDA-X for AI, autonomous machines, high performance computing, and graphics. With new NVIDIA Ampere Architecture GPUs, Tensor RT also uses sparse tensor cores for an additional performance boost."
-
-[BART.all."facebook/bart-large".all.summarization]
-
-label = "NVIDIA TensorRT-based applications perform up to 36X faster than CPU-only platforms during inference, enabling developers to optimize neural network models trained on all major frameworks, calibrate for lower precision with high accuracy, and deploy to hyperscale data centers, embedded platforms, or automotive product platforms. Tensor RT is the first GPU-based inference platform to use NVIDIA's CUDA-X architecture. TenseRT, built on the NVIDIA CUDA parallel programming model, enables developers to analyze neural network data and perform inference by leveraging libraries, development tools, and technologies in CUDA, including CUDA for AI, autonomous machines, high performance computing, and graphics. With new NVIDIA Ampere Architecture GPUs, TensorRex also uses sparse tensor cores for an additional performance boost."
-
-[BART.all."facebook/bart-large-cnn".all.summarization]
-
-label = "TensorRT-based applications perform up to 36X faster than CPU-only platforms during inference. TensorRT is built on the NVIDIA CUDA parallel programming model. With new NVIDIA Ampere Architecture GPUs, Tensor RT also uses sparse tensor cores for an additional performance boost."
-
-[BART.all."facebook/mbart-large-50".all.summarization]
-
-label = "NVIDIA TensorRT-based applications perform up to 36X faster than CPU-only platforms during inference, enabling developers to optimize neural network models trained on all major frameworks, calibrate for lower precision with high accuracy, and deploy to hyperscale data centers, embedded platforms, or automotive product platforms. TensorTM, built on the NVIDIA CUDA parallel programming model, enables developers of applications to optimise inference by leveraging libraries, development tools, and technologies in CUDA-X for AI, autonomous machines, high performance computing, and graphics. With new NVIDIA Ampere Architecture GPUs, Tensor RT also uses sparse tensor cores for an additional performance boost."
-
-# There is a weird bug in Frameworks where the output is incorrect
-# when compared to OnnxRT. Frameworks only the first two sentence is generated.
-[BART.native."facebook/bart-large-cnn".summarization]
-
-label = "TensorRT-based applications perform up to 36X faster than CPU-only platforms during inference. TensorRT is built on the NVIDIA CUDA parallel programming model."
diff --git a/demo/HuggingFace/BART/export.py b/demo/HuggingFace/BART/export.py
deleted file mode 100755
index f3730178..00000000
--- a/demo/HuggingFace/BART/export.py
+++ /dev/null
@@ -1,419 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Contains logic that captures BART HuggingFace models into ONNX models.
-"""
-
-from itertools import islice
-from json import encoder
-import os
-from collections import OrderedDict
-
-# tensorrt
-import tensorrt as trt
-
-# polygraphy
-from polygraphy.backend.trt import Profile
-
-# torch
-import torch
-from torch.nn import Module
-
-# huggingface
-from transformers.generation_utils import GenerationMixin
-from transformers.modeling_outputs import Seq2SeqLMOutput
-from transformers import BartForConditionalGeneration
-
-# TRT-HuggingFace
-from BART.BARTModelConfig import BARTModelTRTConfig
-from NNDF.tensorrt_utils import OnnxProcessOperation, process_onnx
-from NNDF.networks import NetworkMetadata, Precision, Dims
-from NNDF.logger import G_LOGGER
-from NNDF.models import (
- TRTEngineFile,
- TorchModelFile,
- ONNXModelFile,
- ModelFileConverter,
-)
-
-def add_extra_fp32(network_definition):
- """
- Force operations involved in layer norm to run in FP32 precision.
- """
- pow_ops = {}
- for layer_index, layer in enumerate(network_definition[1]):
- if layer.type == trt.LayerType.IDENTITY:
- all_fp32 = all([layer.output_type_is_set(o) and layer.get_output_type(o) == trt.float32 for o in range(layer.num_outputs)])
- if all_fp32:
- if layer.get_input(0).dtype == trt.float32:
- layer.precision = trt.float32
-
- if layer.type == trt.LayerType.ELEMENTWISE:
- layer.__class__ = getattr(trt, "IElementWiseLayer")
- if layer.op == trt.ElementWiseOperation.POW:
- pow_ops[layer] = layer_index
- layer.precision = trt.float32
- layer.set_output_type(0, trt.float32)
-
- for _, index in pow_ops.items():
- # Iterate from few layers before pow to include residual add and cast op.
- # Iterate till 10 layers after pow op to include all operations included in layer norm.
- START_OFFSET = 4
- END_OFFSET = 12
- for i in range(index-START_OFFSET, index+END_OFFSET):
- l = network_definition[1].get_layer(i)
- if l.type == trt.LayerType.REDUCE:
- l.precision = trt.float32
- l.set_output_type(0, trt.float32)
-
- if l.type == trt.LayerType.ELEMENTWISE:
- l.__class__ = getattr(trt, "IElementWiseLayer")
- if l.op == trt.ElementWiseOperation.SUM:
- l.precision = trt.float32
- l.set_output_type(0, trt.float32)
-
- if l.type == trt.LayerType.UNARY:
- l.__class__ = getattr(trt, "IUnaryLayer")
- if l.op == trt.UnaryOperation.SQRT:
- l.precision = trt.float32
- l.set_output_type(0, trt.float32)
-
- if l.type == trt.LayerType.ELEMENTWISE:
- l.__class__ = getattr(trt, "IElementWiseLayer")
- if l.op == trt.ElementWiseOperation.DIV:
- l.precision = trt.float32
- l.set_output_type(0, trt.float32)
-
- if l.type == trt.LayerType.ELEMENTWISE:
- l.__class__ = getattr(trt, "IElementWiseLayer")
- if l.op == trt.ElementWiseOperation.PROD:
- l.precision = trt.float32
- l.set_output_type(0, trt.float32)
-
- return network_definition
-
-# Torch File Encoding #
-class BARTDecoderTorchFile(TorchModelFile):
- class TorchModule(Module, GenerationMixin):
- """
- A simplied definition of BART Decoder without support for loss.
- Decoder with lm-head attached.
- """
-
- def __init__(self, decoder, lm_head, final_logits_bias, config):
- super().__init__()
- self.decoder = decoder
- self.lm_head = lm_head
- self.bias = final_logits_bias
- self.config = config
-
- @staticmethod
- def _reorder_cache(past, beam_idx):
- return BartForConditionalGeneration._reorder_cache(past, beam_idx)
-
- def prepare_inputs_for_generation(self, input_ids, past=None, use_cache=None, **kwargs):
- # cut decoder_input_ids if past is used
- if past is not None:
- input_ids = input_ids[:, -1:]
-
- ret = {
- "input_ids": input_ids,
- "encoder_hidden_states": kwargs["encoder_hidden_states"],
- }
-
- # To really enable KV cache in HuggingFace, these args must be passed. Just specifying use_cache = True in BartConfig is not enough. Also see the additional "past_key_values" fields in the forward() return below.
- if self.config.use_cache:
- ret["use_cache"] = use_cache
- ret["past_key_values"] = past
-
- return ret
-
- def forward(self, input_ids, encoder_hidden_states, **kwargs):
- decoder_outputs = self.decoder(
- input_ids=input_ids,
- encoder_hidden_states=encoder_hidden_states,
- **kwargs
- )
-
- sequence_output = decoder_outputs[0]
- self.bias = self.bias.to(sequence_output.device)
- logits = self.lm_head(sequence_output) + self.bias
-
- # temporary solution: force connection between encoder_hidden_states and outputs in KV cache mode, otherwise onnx.export elimiates it and cause inconsistency between non-KV cache & KV cache and also T5 & BART
- if self.config.use_cache:
- logits = logits.view(encoder_hidden_states.size(0),logits.size(1), logits.size(2)) # (batch_size, seq_len, vocab_size)
-
- if not kwargs.get("return_dict", False):
- return (logits,) + decoder_outputs[1:]
-
- return Seq2SeqLMOutput(logits=logits, past_key_values=decoder_outputs.past_key_values if self.config.use_cache else None,)
-
- def __init__(self, model, network_metadata):
- super().__init__(model, BARTDecoderConverter, network_metadata)
-
-
-class BARTEncoderTorchFile(TorchModelFile):
- """Creation of a class to output only the last hidden state from the encoder."""
-
- class TorchModule(Module, GenerationMixin):
- def __init__(self, encoder):
- super().__init__()
- self.encoder = encoder
-
- def forward(self, *input, **kwargs):
- return self.encoder(*input, **kwargs)[0]
-
- def __call__(self, *args, **kwargs):
- return self.forward(*args, **kwargs)
-
- def __init__(self, model, network_metadata):
- super().__init__(model, BARTEncoderConverter, network_metadata)
-
-
-# ONNX File Encoding #
-class BARTEncoderONNXFile(ONNXModelFile):
- def __init__(self, model, network_metadata):
- super().__init__(model, BARTEncoderConverter, network_metadata)
-
-
-class BARTDecoderONNXFile(ONNXModelFile):
- def __init__(self, model, network_metadata):
- super().__init__(model, BARTDecoderConverter, network_metadata)
-
-
-# TRT Engine File Encoding #
-class BARTDecoderTRTEngine(TRTEngineFile):
-
- def __init__(self, model, network_metadata):
- super().__init__(model, BARTDecoderConverter, network_metadata)
- self.max_trt_workspace = BARTModelTRTConfig.MAX_DECODER_WORKSPACE_MB[network_metadata.variant]
-
- def get_network_definition(self, network_definition):
- return add_extra_fp32(network_definition)
-
- def use_obey_precision_constraints(self):
- return self.network_metadata.precision.fp16
-
-
-class BARTEncoderTRTEngine(TRTEngineFile):
-
- def __init__(self, model, network_metadata):
- super().__init__(model, BARTEncoderConverter, network_metadata)
- self.max_trt_workspace = 2048
-
- def get_network_definition(self, network_definition):
- return add_extra_fp32(network_definition)
-
- def use_obey_precision_constraints(self):
- return self.network_metadata.precision.fp16
-
-# Converters #
-class BARTDecoderConverter(ModelFileConverter):
- def __init__(self):
- super().__init__(BARTDecoderTorchFile, BARTDecoderONNXFile, BARTDecoderTRTEngine)
-
- def torch_to_onnx(
- self, output_fpath: str, model: Module, network_metadata: NetworkMetadata
- ):
- """
- Exports a given huggingface BART to decoder architecture only.
-
- Args:
- output_prefix (str): Path to the onnx file
- model (torch.Model): Model loaded torch class
-
- Returns:
- BARTDecoderONNXFile: ONNX decoder object.
- """
-
- input_ids = torch.tensor([[42] * 10])
- # Exporting the decoder requires a basic instance of the encoder
- # Create one temporarily
- simplified_encoder = BARTEncoderTorchFile.TorchModule(model.get_encoder())
- # Exports to ONNX
- decoder_with_lm_head_and_bias = BARTDecoderTorchFile.TorchModule(
- model.get_decoder(), model.lm_head, model.final_logits_bias, model.config
- )
-
- inputs = BARTModelTRTConfig.get_input_dims(network_metadata)["decoder"]
- outputs = BARTModelTRTConfig.get_output_dims(network_metadata)["decoder"]
-
- # Exports to ONNX
- opt_args={}
-
- version_major = int((torch.__version__).split('.')[0])
- version_minor = int((torch.__version__).split('.')[1])
- if version_major < 1 or (version_major == 1 and version_minor < 11):
- opt_args['use_external_data_format'] = True
-
- if not network_metadata.other.kv_cache:
- # This code allows for huggingface compatible torch class to use onnx exporter
- old_forward = decoder_with_lm_head_and_bias.forward
- def _export_forward(*args, **kwargs):
- result = old_forward(*args, **kwargs)
- return result[0]
- decoder_with_lm_head_and_bias.forward = _export_forward
-
- torch.onnx.export(
- decoder_with_lm_head_and_bias,
- (input_ids, simplified_encoder(input_ids)),
- output_fpath,
- export_params=True,
- opset_version=12,
- input_names=inputs.get_names(),
- output_names=outputs.get_names(),
- dynamic_axes={
- **inputs.get_torch_dynamic_axis_encoding(),
- **outputs.get_torch_dynamic_axis_encoding(),
- },
- training=torch.onnx.TrainingMode.EVAL,
- **opt_args
- )
- else:
- encoder_hidden_states = simplified_encoder(input_ids)
- decoder_output = decoder_with_lm_head_and_bias(input_ids[:,:-1], encoder_hidden_states) # decoder output at t-1 step (logits, past_key_values from 0 to t-1)
- past_key_values = decoder_output[1]
-
- decoder_root, decoder_fullname = os.path.split(output_fpath)
- # Split kv and non kv onnx into separate folders to avoid weight overlap
- non_kv_root = os.path.join(decoder_root, "non-kv")
- kv_root = os.path.join(decoder_root, "kv")
- decoder_name, decoder_ext = os.path.splitext(decoder_fullname)
- non_kv_fpath = os.path.join(non_kv_root, decoder_name + "-non-kv" + decoder_ext)
- kv_fpath = os.path.join(kv_root, decoder_fullname)
-
- # This code allows for huggingface compatible torch class to use onnx exporter (change just before onnx.export)
- old_forward = decoder_with_lm_head_and_bias.forward
- def _export_forward(input_ids, encoder_hidden_states, past_key_values):
- result = old_forward(input_ids, encoder_hidden_states, past_key_values=past_key_values)
- return (result[0], result[1])
- decoder_with_lm_head_and_bias.forward = _export_forward
-
- torch.onnx.export(
- decoder_with_lm_head_and_bias,
- (input_ids[:,-1:], encoder_hidden_states,past_key_values),
- # (1) input_ids should be the t token (last one) while past_key_values is 0 to t-1 caches
- # (2) since past_key_values is kwargs, ideally use "(input_ids[:,-1:], encoder_hidden_states, {"past_key_values": past_key_values})",
- # but onnx.export seems to unable to take kwargs properly (although PyTorch 1.11 claims it supports already).
- # Therefore, we need to wrap inside _export_forward() and make past_key_values indeed a kwargs
- kv_fpath,
- export_params=True,
- opset_version=12,
- input_names=inputs.get_names(),
- output_names=outputs.get_names(),
- dynamic_axes={
- **inputs.get_torch_dynamic_axis_encoding(),
- **outputs.get_torch_dynamic_axis_encoding(),
- },
- training=torch.onnx.TrainingMode.EVAL,
- **opt_args
- )
-
- # dual-engine approach: also export non-kv onnx model. Note that this is different from the original "non-kv" model. This one traces the `use_cache` path and have present_key_values output
- def _export_forward(input_ids, encoder_hidden_states, use_cache):
- result = old_forward(input_ids, encoder_hidden_states, use_cache=use_cache)
- return (result[0], result[1])
- decoder_with_lm_head_and_bias.forward = _export_forward
-
- # inputs are same as non-kv model
- # outputs are same as kv model
- dict_inputs = inputs.get_dims()
- dict_inputs_non_kv = OrderedDict({k: dict_inputs[k] for k in ["input_ids", "encoder_hidden_states"]})
- inputs_non_kv = Dims(dict_inputs_non_kv)
-
- torch.onnx.export(
- decoder_with_lm_head_and_bias,
- (input_ids[:,-1:], encoder_hidden_states, True),
- non_kv_fpath,
- export_params=True,
- opset_version=12,
- input_names=inputs_non_kv.get_names(),
- output_names=outputs.get_names(),
- dynamic_axes={
- **inputs_non_kv.get_torch_dynamic_axis_encoding(),
- **outputs.get_torch_dynamic_axis_encoding(),
- },
- training=torch.onnx.TrainingMode.EVAL,
- **opt_args
- )
-
- if network_metadata.precision.fp16:
- G_LOGGER.debug("Clamping FP16 weights for BART")
- # BART doesn't have T5's Add-Cast-Pow ordering issue
- if network_metadata.other.kv_cache:
- # both onnx files need clamp
- process_onnx([OnnxProcessOperation.CLAMP_WEIGHTS], kv_fpath, kv_fpath)
- process_onnx([OnnxProcessOperation.CLAMP_WEIGHTS], non_kv_fpath, non_kv_fpath)
-
- else:
- process_onnx([OnnxProcessOperation.CLAMP_WEIGHTS], output_fpath, output_fpath)
-
- return BARTDecoderONNXFile(output_fpath, network_metadata)
-
-
-class BARTEncoderConverter(ModelFileConverter):
- def __init__(self):
- super().__init__(BARTEncoderTorchFile, BARTEncoderONNXFile, BARTEncoderTRTEngine)
-
- def torch_to_onnx(
- self, output_fpath: str, model: Module, network_metadata: NetworkMetadata
- ):
- """
- Exports a given huggingface BART to encoder architecture only.
-
- Args:
- output_prefix (str): Path to the onnx file
- model (torch.Model): Model loaded torch class
-
- Returns:
- Tuple[str]: Names of generated models
- """
- input_ids = torch.tensor([[42] * 10])
- simplified_encoder = BARTEncoderTorchFile.TorchModule(model.get_encoder())
- inputs = BARTModelTRTConfig.get_input_dims(network_metadata)["encoder"]
- outputs = BARTModelTRTConfig.get_output_dims(network_metadata)["encoder"]
-
- # Exports to ONNX
- opt_args={}
-
- version_major = int((torch.__version__).split('.')[0])
- version_minor = int((torch.__version__).split('.')[1])
- if version_major < 1 or (version_major == 1 and version_minor < 11):
- opt_args['use_external_data_format'] = True
- torch.onnx._export(
- simplified_encoder,
- input_ids,
- output_fpath,
- export_params=True,
- opset_version=12,
- input_names=inputs.get_names(),
- output_names=outputs.get_names(),
- dynamic_axes={
- **inputs.get_torch_dynamic_axis_encoding(),
- **outputs.get_torch_dynamic_axis_encoding(),
- },
- training=torch.onnx.TrainingMode.EVAL,
- **opt_args
- )
-
- if network_metadata.precision.fp16:
- G_LOGGER.debug("Clamping FP16 weights for BART")
- # BART doesn't have T5's Add-Cast-Pow ordering issue
- process_onnx([OnnxProcessOperation.CLAMP_WEIGHTS], output_fpath, output_fpath)
-
- return BARTEncoderONNXFile(output_fpath, network_metadata)
diff --git a/demo/HuggingFace/BART/frameworks.py b/demo/HuggingFace/BART/frameworks.py
deleted file mode 100644
index 3df3e908..00000000
--- a/demo/HuggingFace/BART/frameworks.py
+++ /dev/null
@@ -1,373 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-import os
-import sys
-
-from typing import List, Union
-
-# huggingface
-from transformers import (
- BartForConditionalGeneration,
- BartTokenizer,
- BartConfig,
- MBartForConditionalGeneration,
- MBart50Tokenizer,
-)
-
-# torch
-import torch
-
-# Add syspath for custom library
-if __name__ == "__main__":
- filepath = os.path.dirname(os.path.abspath(__file__))
- project_root = os.path.join(filepath, os.pardir)
- sys.path.append(project_root)
-
-# TRT-HuggingFace
-from NNDF.interface import FrameworkCommand
-from NNDF.networks import (
- BenchmarkingResult,
- NetworkResult,
- NetworkMetadata,
- NetworkRuntime,
- NetworkModels,
- NetworkModel,
- TimingProfile,
-)
-from BART.export import BARTEncoderTorchFile, BARTDecoderTorchFile
-from BART.BARTModelConfig import BARTModelTRTConfig, BARTBenchmarkingArgs
-from BART.measurements import decoder_inference, encoder_inference, full_inference_greedy, full_inference_beam, calculate_perplexity
-from NNDF.general_utils import confirm_folder_delete, NNFolderWorkspace
-
-
-class BARTHuggingFace(FrameworkCommand):
- def __init__(self):
- super().__init__(
- BARTModelTRTConfig, description="Runs framework results for BART model."
- )
-
- self.onnx_BART_encoder = None
- self.onnx_BART_decoder = None
- self.torch_BART_dir = None
-
- def generate_and_download_framework(
- self, metadata: NetworkMetadata, workspace: NNFolderWorkspace
- ) -> NetworkModels:
-
- cache_variant = False
- if metadata.other.kv_cache:
- cache_variant = True
-
- trt_BART_config = self.config
- metadata_serialized = trt_BART_config.get_metadata_string(metadata)
- workspace_dir, encoder_onnx_root, decoder_onnx_root = workspace.set_model_path(metadata_serialized, is_encoder_decoder = True)
- pytorch_model_dir = os.path.join(workspace_dir, "pytorch_model")
-
- # We keep track of the generated torch location for cleanup later
- self.torch_BART_dir = pytorch_model_dir
-
- model = None
- tfm_config = BartConfig(
- use_cache=cache_variant,
- num_layers=BARTModelTRTConfig.NUMBER_OF_LAYERS[metadata.variant],
- ) # Note
- if not os.path.exists(pytorch_model_dir):
- # mbart variant cannot be recognized by HF yet
- if "mbart" not in metadata.variant:
- # Generate the pre-trained weights
- model = BartForConditionalGeneration(tfm_config).from_pretrained(
- metadata.variant
- )
- else:
- model = MBartForConditionalGeneration.from_pretrained(metadata.variant)
-
- model.config.use_cache = cache_variant # somehow the use_cache config automatically set to True even though specified in tfm_config before. Force change
- model.save_pretrained(pytorch_model_dir)
- print("Pytorch Model saved to {}".format(pytorch_model_dir))
- else:
- print(
- "Frameworks file already exists, skipping generation and loading from file instead."
- )
- if "mbart" not in metadata.variant:
- model = BartForConditionalGeneration(tfm_config).from_pretrained(
- pytorch_model_dir
- )
- else:
- model = MBartForConditionalGeneration.from_pretrained(pytorch_model_dir)
-
- model.config.use_cache = cache_variant # somehow the use_cache config automatically set to True even though specified in tfm_config before. Force change
-
- # These ONNX models can be converted using special encoder and decoder classes.
- encoder_onnx_model_fpath = os.path.join(encoder_onnx_root, metadata_serialized + "-encoder.onnx")
- decoder_onnx_model_fpath = os.path.join(decoder_onnx_root, metadata_serialized + "-decoder-with-lm-head.onnx")
-
- BART_encoder = BARTEncoderTorchFile(model, metadata)
- BART_decoder = BARTDecoderTorchFile(model, metadata)
- self.onnx_BART_encoder = BART_encoder.as_onnx_model(
- encoder_onnx_model_fpath, force_overwrite=False
- )
- self.onnx_BART_decoder = BART_decoder.as_onnx_model(
- decoder_onnx_model_fpath, force_overwrite=False
- )
-
- onnx_models = [
- NetworkModel(
- name=BARTModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
- fpath=self.onnx_BART_decoder.fpath,
- ),
- NetworkModel(
- name=BARTModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME,
- fpath=self.onnx_BART_encoder.fpath,
- ),
- ]
- torch_models = [
- NetworkModel(
- name=BARTModelTRTConfig.NETWORK_FULL_NAME, fpath=pytorch_model_dir
- )
- ]
-
- return NetworkModels(torch=torch_models, onnx=onnx_models, trt=None)
-
- def cleanup(
- self,
- workspace: NNFolderWorkspace,
- keep_onnx_model: bool = True,
- keep_pytorch_model: bool = True,
- ) -> None:
- """
- Cleans up the working directory and leaves models if available.
- Should not assume any functions from the framework class has been called.
- Return:
- None
- """
- # Clean-up generated files
- if not keep_onnx_model:
- if self.onnx_BART_decoder is not None:
- self.onnx_BART_decoder.cleanup()
- if self.onnx_BART_encoder is not None:
- self.onnx_BART_encoder.cleanup()
-
- if not keep_pytorch_model:
- # Using rmtree can be dangerous, have user confirm before deleting.
- confirm_folder_delete(
- self.torch_BART_dir,
- prompt="Confirm you want to delete downloaded pytorch model folder?",
- )
-
- if not keep_pytorch_model and not keep_onnx_model:
- workspace.cleanup(force_remove=False)
-
- def setup_tokenizer_and_model(
- self,
- metadata: NetworkMetadata,
- network_fpaths: NetworkModels,
- ):
- tokenizer = BartTokenizer.from_pretrained(metadata.variant)
-
- # By default, huggingface model structure is one giant file.
- BART_torch_fpath = network_fpaths.torch[0].fpath
- config = BartConfig(
- use_cache=metadata.other.kv_cache,
- num_layers=BARTModelTRTConfig.NUMBER_OF_LAYERS[metadata.variant],
- )
- BART_model = BartForConditionalGeneration(config).from_pretrained(BART_torch_fpath)
- if "mbart" in metadata.variant:
- BART_model = MBartForConditionalGeneration(config).from_pretrained(BART_torch_fpath)
- tokenizer = MBart50Tokenizer.from_pretrained(metadata.variant, src_lang="en_XX")
-
- BART_torch_encoder = BARTEncoderTorchFile.TorchModule(BART_model.get_encoder())
- BART_torch_decoder = BARTDecoderTorchFile.TorchModule(
- BART_model.get_decoder(), BART_model.lm_head, BART_model.final_logits_bias, BART_model.config
- )
-
- return tokenizer, BART_torch_encoder, BART_torch_decoder
-
- def execute_inference(
- self,
- metadata: NetworkMetadata,
- network_fpaths: NetworkModels,
- inference_input: str,
- timing_profile: TimingProfile,
- use_cpu: bool,
- batch_size: int = 1,
- num_beams: int = 1,
- benchmarking_mode: bool = False,
- benchmarking_args: BARTBenchmarkingArgs = None,
- ) -> Union[NetworkResult, BenchmarkingResult]:
-
- tokenizer, BART_torch_encoder, BART_torch_decoder = self.setup_tokenizer_and_model(metadata, network_fpaths)
-
- # Prepare the input tokens and find output sequence length.
- if not benchmarking_mode:
- output_seq_len = BARTModelTRTConfig.MAX_OUTPUT_LENGTH[metadata.variant]
- input_ids = tokenizer([inference_input] * batch_size, padding=True, return_tensors="pt").input_ids
- else:
- max_seq_len = BARTModelTRTConfig.MAX_SEQUENCE_LENGTH[metadata.variant]
- input_seq_len = benchmarking_args.input_seq_len if benchmarking_args.input_seq_len > 0 else max_seq_len
- output_seq_len = benchmarking_args.output_seq_len if benchmarking_args.output_seq_len > 0 else max_seq_len
- input_ids = torch.randint(0, BARTModelTRTConfig.VOCAB_SIZE[metadata.variant], (batch_size, input_seq_len))
-
- encoder_last_hidden_state, encoder_e2e_time = encoder_inference(
- BART_torch_encoder, input_ids, timing_profile, use_cuda=(not use_cpu)
- )
-
- # Need to feed the decoder a new empty input_ids for text generation.
- decoder_output_len = output_seq_len // 2 if (not metadata.other.kv_cache) else 1
- decoder_input_ids = torch.full(
- (batch_size, decoder_output_len), tokenizer.convert_tokens_to_ids(tokenizer.pad_token), dtype=torch.int32
- )
-
- _, decoder_e2e_time = decoder_inference(
- BART_torch_decoder, decoder_input_ids, encoder_last_hidden_state, timing_profile, use_cuda=(not use_cpu), use_cache=metadata.other.kv_cache
- )
-
- if num_beams == 1:
- decoder_output, full_e2e_runtime = full_inference_greedy(
- BART_torch_encoder,
- BART_torch_decoder,
- input_ids,
- tokenizer,
- timing_profile,
- max_length=output_seq_len,
- min_length=BARTModelTRTConfig.MIN_OUTPUT_LENGTH[metadata.variant] if not benchmarking_mode else output_seq_len,
- use_cuda=(not use_cpu),
- batch_size=batch_size,
- use_cache=metadata.other.kv_cache,
- )
- else:
- decoder_output, full_e2e_runtime = full_inference_beam(
- BART_torch_encoder,
- BART_torch_decoder,
- input_ids,
- tokenizer,
- timing_profile,
- num_beams=num_beams,
- max_length=output_seq_len,
- min_length=BARTModelTRTConfig.MIN_OUTPUT_LENGTH[metadata.variant] if not benchmarking_mode else output_seq_len,
- batch_size=batch_size,
- use_cache=metadata.other.kv_cache,
- )
-
- # Prepare runtime results.
- runtime=[
- NetworkRuntime(
- name=BARTModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
- runtime=decoder_e2e_time,
- ),
- NetworkRuntime(
- name=BARTModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME,
- runtime=encoder_e2e_time,
- ),
- NetworkRuntime(
- name=BARTModelTRTConfig.NETWORK_FULL_NAME,
- runtime=full_e2e_runtime,
- ),
- ]
-
- # Skip result checking in benchmarking mode since the input data is random.
- if benchmarking_mode:
- return BenchmarkingResult(median_runtime=runtime, models=network_fpaths)
-
- # Remove the padding and end tokens.
- semantic_outputs = tokenizer.decode(
- decoder_output[-1, :], skip_special_tokens=True
- )
-
- if isinstance(semantic_outputs, list):
- semantic_outputs = " ".join(semantic_outputs).strip()
-
- return NetworkResult(
- input=inference_input,
- output_tensor=decoder_output,
- semantic_output=semantic_outputs,
- median_runtime=runtime,
- models=network_fpaths,
- )
-
- def execute_calculate_perplexity(
- self,
- metadata: NetworkMetadata,
- network_fpaths: NetworkModels,
- encoder_input: str,
- decoder_input: str,
- ):
- tokenizer, BART_torch_encoder, BART_torch_decoder = self.setup_tokenizer_and_model(metadata, network_fpaths)
- encoder_input_ids = tokenizer([encoder_input], padding=True, return_tensors="pt").input_ids
- decoder_input_ids = tokenizer([decoder_input], padding=True, return_tensors="pt").input_ids
- perplexity = calculate_perplexity(
- BART_torch_encoder, BART_torch_decoder, tokenizer, encoder_input_ids, decoder_input_ids,
- BARTModelTRTConfig.MAX_SEQUENCE_LENGTH[metadata.variant],
- )
- return perplexity
-
- def run_framework(
- self,
- metadata: NetworkMetadata,
- network_input: List[str],
- working_directory: str,
- keep_onnx_model: bool,
- keep_pytorch_model: bool,
- timing_profile: TimingProfile,
- use_cpu: bool = False,
- batch_size: int = 1,
- args: object = None,
- benchmarking_mode: bool = False,
- perplexity_reference: List[str] = None,
- ) -> Union[List[NetworkResult], BenchmarkingResult]:
- """
- Main entry point of our function which compiles and generates our model data.
- """
- inference_results = []
- ppl_results = []
- workspace = NNFolderWorkspace(
- self.config.network_name, metadata, working_directory
- )
- try:
- network_fpaths = self.generate_and_download_framework(metadata, workspace)
- if not benchmarking_mode:
- for ninput in network_input:
- inference_results.append(
- self.execute_inference(
- metadata, network_fpaths, ninput, timing_profile, use_cpu, batch_size, args.num_beams
- )
- )
- if perplexity_reference is not None:
- assert len(network_input) == len(perplexity_reference), "Encoder and decoder inputs must pair up"
- for ei, di in zip(network_input, perplexity_reference):
- ppl_results.append(
- self.execute_calculate_perplexity(
- metadata, network_fpaths, ei, di
- )
- )
- else:
- benchmarking_args = BARTBenchmarkingArgs(args.input_seq_len, args.output_seq_len)
- inference_results = self.execute_inference(
- metadata, network_fpaths, None, timing_profile, use_cpu, batch_size, args.num_beams, True, benchmarking_args
- )
- finally:
- self.cleanup(workspace, keep_onnx_model, keep_pytorch_model)
-
- return inference_results, ppl_results
-
-
-# Entry point
-RUN_CMD = BARTHuggingFace()
-
-if __name__ == "__main__":
- result = RUN_CMD()
- print("Results: {}".format(result))
diff --git a/demo/HuggingFace/BART/hf.py b/demo/HuggingFace/BART/hf.py
deleted file mode 100755
index ae79b64c..00000000
--- a/demo/HuggingFace/BART/hf.py
+++ /dev/null
@@ -1,68 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Obtain the benchmark timing and output from the original HuggingFace BART model.
-
-Usage: python3 hf.py --variant facebook/bart-base [--enable-kv-cache] [--fp16]
-"""
-
-import time
-from transformers import BartTokenizer, BartForConditionalGeneration
-import argparse
-
-parser = argparse.ArgumentParser()
-parser.add_argument("--variant", help="Name of BART variant.")
-parser.add_argument("--enable-kv-cache", help="Bart enable KV cache", action="store_true", default=False)
-parser.add_argument("--fp16", help="Bart FP16", action="store_true", default=False)
-parser.add_argument("--num-beams", type=int, default=1, help="Enables beam search during decoding.")
-
-args = parser.parse_args()
-
-model = BartForConditionalGeneration.from_pretrained(args.variant) # facebook/bart-base, facebook/bart-large, facebook/bart-large-cnn
-tokenizer = BartTokenizer.from_pretrained(args.variant)
-model = model.to('cuda').eval()
-
-if args.fp16:
- model = model.half()
-
-ARTICLE_TO_SUMMARIZE = (
- "NVIDIA TensorRT-based applications perform up to 36X faster than CPU-only platforms during inference, enabling developers to optimize neural network models trained on all major frameworks, calibrate for lower precision with high accuracy, and deploy to hyperscale data centers, embedded platforms, or automotive product platforms. TensorRT, built on the NVIDIA CUDA parallel programming model, enables developers to optimize inference by leveraging libraries, development tools, and technologies in CUDA-X for AI, autonomous machines, high performance computing, and graphics. With new NVIDIA Ampere Architecture GPUs, TensorRT also uses sparse tensor cores for an additional performance boost."
-)
-
-input_ids = tokenizer([ARTICLE_TO_SUMMARIZE], padding=True, return_tensors="pt").input_ids.to('cuda')
-
-warmup = 10
-for i in range(warmup):
- summary_ids = model.generate(input_ids, max_length=1024, num_beams=args.num_beams, use_cache=args.enable_kv_cache)
-
-start = time.time()
-trials = 10
-
-input_ids = tokenizer([ARTICLE_TO_SUMMARIZE], padding=True, return_tensors="pt").input_ids.to('cuda')
-
-for i in range(trials):
- # Generate Summary. Note: generate() method already has torch.no_grad() decorator.
- summary_ids = model.generate(input_ids, max_length=1024, num_beams=args.num_beams, use_cache=args.enable_kv_cache)
-
-end = time.time()
-
-output = tokenizer.decode(summary_ids[-1,:], skip_special_tokens=True)
-
-print('BART output: ', output)
-print(f"Input sequence length: {input_ids.size(1)}, Output sequence length: {summary_ids[-1,:].size(0)}")
-print("Average run time: {:.2f} ms".format((end - start)/trials*1000))
diff --git a/demo/HuggingFace/BART/measurements.py b/demo/HuggingFace/BART/measurements.py
deleted file mode 100644
index 54f809b0..00000000
--- a/demo/HuggingFace/BART/measurements.py
+++ /dev/null
@@ -1,280 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Utils specific to BART network.
-"""
-
-# torch
-import torch
-
-# from HuggingFace transformers
-from transformers.generation_logits_process import (
- NoRepeatNGramLogitsProcessor,
- MinLengthLogitsProcessor,
- ForcedBOSTokenLogitsProcessor,
- ForcedEOSTokenLogitsProcessor,
- LogitsProcessorList,
-)
-from transformers.generation_stopping_criteria import (
- MaxLengthCriteria,
- StoppingCriteriaList,
-)
-from transformers.generation_beam_search import (
- BeamSearchScorer,
-)
-
-from BART.BARTModelConfig import BARTModelTRTConfig
-
-# TRT-HuggingFace
-from NNDF.general_utils import measure_python_inference_code
-from NNDF.torch_utils import use_cuda, expand_inputs_for_beam_search
-from NNDF.tensorrt_utils import TRTNativeRunner
-from NNDF.logger import G_LOGGER
-
-@use_cuda
-def decoder_inference(
- BART_decoder, input_ids, encoder_last_hidden_state, timing_profile, use_cuda=True, use_cache=False, past_key_values=None
-):
- # This implementation is a bit ugly. Moving implementation of the model to check HFRunner would be cleaner.
- if isinstance(BART_decoder, TRTNativeRunner):
- # Function is technically in BARTTRTDecoder however due to circular import, TRTNativeRunner in this module scope
- # implies the existence of this function.
- BART_decoder.set_encoder_hidden_states_for_inference_cycle(encoder_last_hidden_state)
- BART_decoder.set_return_device("cuda" if use_cuda else "cpu")
-
- def decoder_stmt():
- BART_decoder(
- input_ids=input_ids, encoder_hidden_states=encoder_last_hidden_state, use_cache=use_cache,
- past_key_values=past_key_values
- )
-
- decoder_e2e_time = measure_python_inference_code(decoder_stmt, timing_profile)
-
- return (decoder_stmt(), decoder_e2e_time)
-
-
-@use_cuda
-def encoder_inference(BART_encoder, input_ids, timing_profile, use_cuda=True):
- encoder_stmt = lambda: BART_encoder(input_ids=input_ids)
- encoder_e2e_time = measure_python_inference_code(encoder_stmt, timing_profile)
-
- return (encoder_stmt(), encoder_e2e_time)
-
-
-# Code specifically for Pythonic inference measurement used across all BART related scripts
-@use_cuda
-def full_inference_greedy(
- BART_encoder,
- BART_decoder,
- input_ids,
- tokenizer,
- timing_profile,
- max_length,
- min_length=0,
- batch_size=1,
- use_cuda=True,
- early_stopping=False,
- use_cache=False
-):
- G_LOGGER.info("Running full inference with greedy decoding...")
-
- stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length)])
- no_repeat_ngram_size = BARTModelTRTConfig.NO_REPEAT_NGRAM_SIZE
- logits_processor = LogitsProcessorList([
- NoRepeatNGramLogitsProcessor(no_repeat_ngram_size),
- MinLengthLogitsProcessor(min_length, tokenizer.convert_tokens_to_ids(tokenizer.eos_token)),
- ForcedBOSTokenLogitsProcessor(tokenizer.convert_tokens_to_ids(tokenizer.bos_token)),
- ForcedEOSTokenLogitsProcessor(max_length, tokenizer.convert_tokens_to_ids(tokenizer.eos_token))
- ]) # by checking HuggingFace's generate() implementation carefully, the default logits processor for BART has no_repeat_ngram_size = 3 and forced_eos_token_id = 2. In this way we can get identical results with raw HuggingFace
-
- decoder_input_ids = torch.full(
- (batch_size, 1), tokenizer.convert_tokens_to_ids(tokenizer.eos_token), dtype=torch.int32
- )
-
- if use_cuda:
- decoder_input_ids = decoder_input_ids.to("cuda")
- else:
- decoder_input_ids = decoder_input_ids.to("cpu")
-
- def _e2e():
- with torch.no_grad():
- encoder_last_hidden_state = BART_encoder(input_ids=input_ids)
- decoder_output_greedy = BART_decoder.greedy_search(
- input_ids=decoder_input_ids,
- encoder_hidden_states=encoder_last_hidden_state,
- stopping_criteria=stopping_criteria,
- logits_processor=logits_processor,
- use_cache=use_cache
- )
- return decoder_output_greedy
-
- # With e2e we can opt to bind inputs only once for hidden states for optimization
- def _e2e_trt():
- with torch.no_grad():
- encoder_last_hidden_state = BART_encoder(input_ids=input_ids)
- BART_decoder.set_encoder_hidden_states_for_inference_cycle(encoder_last_hidden_state)
- decoder_output_greedy = BART_decoder.greedy_search(
- input_ids=decoder_input_ids,
- encoder_hidden_states=encoder_last_hidden_state,
- stopping_criteria=stopping_criteria,
- logits_processor=logits_processor,
- use_cache=use_cache
- )
- return decoder_output_greedy
-
- measurement_function = _e2e
- if isinstance(BART_decoder, TRTNativeRunner):
- BART_decoder.set_return_device("cuda" if use_cuda else "cpu")
- measurement_function = _e2e_trt
-
- full_e2e_time = measure_python_inference_code(measurement_function, timing_profile)
-
- return (measurement_function(), full_e2e_time)
-
-@use_cuda
-def full_inference_beam(
- BART_encoder,
- BART_decoder,
- input_ids,
- tokenizer,
- timing_profile,
- num_beams,
- max_length,
- min_length=0,
- batch_size=1,
- use_cuda=True,
- early_stopping=False, # Now used to control beam search early_stopping to have the same meaning as HuggingFace
- use_cache=False
-):
-
- G_LOGGER.info(f"Running full inference with beam search (num_beams = {num_beams}) decoding...")
-
- stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length)])
- no_repeat_ngram_size = BARTModelTRTConfig.NO_REPEAT_NGRAM_SIZE
- logits_processor = LogitsProcessorList([
- NoRepeatNGramLogitsProcessor(no_repeat_ngram_size),
- MinLengthLogitsProcessor(min_length, tokenizer.convert_tokens_to_ids(tokenizer.eos_token)),
- ForcedBOSTokenLogitsProcessor(tokenizer.convert_tokens_to_ids(tokenizer.bos_token)),
- ForcedEOSTokenLogitsProcessor(max_length, tokenizer.convert_tokens_to_ids(tokenizer.eos_token))
- ]) # by checking HuggingFace's generate() implementation carefully, the default logits processor for BART has no_repeat_ngram_size = 3 and forced_eos_token_id = 2. In this way we can get identical results with raw HuggingFace
-
- decoder_input_ids = torch.full(
- (batch_size, 1), tokenizer.convert_tokens_to_ids(tokenizer.eos_token), dtype=torch.int32
- )
- decoder_input_ids = expand_inputs_for_beam_search(decoder_input_ids, expand_size=num_beams)
-
- if use_cuda:
- decoder_input_ids = decoder_input_ids.to("cuda")
- else:
- decoder_input_ids = decoder_input_ids.to("cpu")
-
- def _e2e():
- with torch.no_grad():
- # beam scorer must be reset before each beam search run, otherwise beam search will be skipped due to scorer cache
- beam_scorer = BeamSearchScorer(
- batch_size=batch_size,
- num_beams=num_beams,
- device="cuda" if use_cuda else "cpu",
- do_early_stopping=early_stopping
- )
-
- encoder_last_hidden_state = BART_encoder(input_ids=input_ids)
-
- encoder_last_hidden_state = expand_inputs_for_beam_search(encoder_last_hidden_state, expand_size=num_beams)
-
- decoder_output_beam = BART_decoder.beam_search(
- input_ids=decoder_input_ids,
- beam_scorer=beam_scorer,
- encoder_hidden_states=encoder_last_hidden_state,
- stopping_criteria=stopping_criteria,
- logits_processor=logits_processor,
- use_cache=use_cache
- )
- return decoder_output_beam
-
- # With e2e we can opt to bind inputs only once for hidden states for optimization
- def _e2e_trt():
- with torch.no_grad():
- # beam scorer must be reset before each beam search run, otherwise beam search will be skipped due to scorer cache
- beam_scorer = BeamSearchScorer(
- batch_size=batch_size,
- num_beams=num_beams,
- device="cuda" if use_cuda else "cpu",
- do_early_stopping=early_stopping
- )
-
- encoder_last_hidden_state = BART_encoder(input_ids=input_ids)
-
- encoder_last_hidden_state = expand_inputs_for_beam_search(encoder_last_hidden_state, expand_size=num_beams)
-
- BART_decoder.set_encoder_hidden_states_for_inference_cycle(encoder_last_hidden_state)
- decoder_output_beam = BART_decoder.beam_search(
- input_ids=decoder_input_ids,
- beam_scorer=beam_scorer,
- encoder_hidden_states=encoder_last_hidden_state,
- stopping_criteria=stopping_criteria,
- logits_processor=logits_processor,
- use_cache=use_cache
- )
- return decoder_output_beam
-
- measurement_function = _e2e
- if isinstance(BART_decoder, TRTNativeRunner):
- BART_decoder.set_return_device("cuda" if use_cuda else "cpu")
- measurement_function = _e2e_trt
-
- full_e2e_time = measure_python_inference_code(measurement_function, timing_profile)
-
- return (measurement_function(), full_e2e_time)
-
-
-@use_cuda
-def calculate_perplexity(
- BART_encoder,
- BART_decoder,
- tokenizer,
- input_ids,
- decoder_input_ids,
- max_seq_len=None,
- use_cuda=True,
-):
- encoder_last_hidden_state = BART_encoder(input_ids=input_ids)
- if isinstance(BART_decoder, TRTNativeRunner):
- BART_decoder.set_return_device("cuda" if use_cuda else "cpu")
- BART_decoder.set_encoder_hidden_states_for_inference_cycle(encoder_last_hidden_state)
-
- # Set the first token to be pad token
- decoder_input_ids_padded = torch.full(
- decoder_input_ids.size()[:-1] + (decoder_input_ids.size()[-1] + 1,),
- tokenizer.convert_tokens_to_ids(tokenizer.eos_token),
- dtype=decoder_input_ids.dtype,
- )
- decoder_input_ids_padded[..., 1:] = decoder_input_ids
-
- if use_cuda:
- encoder_last_hidden_state = encoder_last_hidden_state.to("cuda")
- decoder_input_ids_padded = decoder_input_ids_padded.to("cuda")
-
- with torch.no_grad():
- if max_seq_len is not None:
- decoder_input_ids_padded = decoder_input_ids_padded[:, :max_seq_len]
- logits = BART_decoder(decoder_input_ids_padded, encoder_last_hidden_state, return_dict=True).logits
- # Truncate the last prediction
- logits = logits[:, :-1, :]
- loss = torch.nn.CrossEntropyLoss()(logits.permute((0, 2, 1)), decoder_input_ids)
- return torch.exp(loss).item()
diff --git a/demo/HuggingFace/BART/onnxrt.py b/demo/HuggingFace/BART/onnxrt.py
deleted file mode 100644
index b7523e0d..00000000
--- a/demo/HuggingFace/BART/onnxrt.py
+++ /dev/null
@@ -1,353 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Executes ONNX Runtime framework code. See README.md for more information.
-"""
-
-import os
-import sys
-from typing import Dict, List, Tuple
-
-# Add syspath for custom library
-if __name__ == "__main__":
- filepath = os.path.dirname(os.path.abspath(__file__))
- project_root = os.path.join(filepath, os.pardir)
- sys.path.append(project_root)
-
-# huggingface
-from transformers import BartTokenizer, BartConfig, PretrainedConfig, MBart50Tokenizer
-from transformers.generation_utils import GenerationMixin
-from transformers.modeling_outputs import Seq2SeqLMOutput
-
-# torch
-import torch
-
-# TRT-HuggingFace
-from NNDF.interface import OnnxRTCommand
-from NNDF.networks import (
- BenchmarkingResult,
- NetworkMetadata,
- NetworkModels,
- NetworkModel,
- NetworkResult,
- NetworkRuntime,
- Precision,
- TimingProfile,
-)
-
-from NNDF.general_utils import NNFolderWorkspace
-from NNDF.tensorrt_utils import PolygraphyOnnxRunner
-from BART.frameworks import BARTHuggingFace
-from BART.BARTModelConfig import BARTModelTRTConfig, BARTBenchmarkingArgs
-from BART.measurements import decoder_inference, encoder_inference, full_inference_greedy, full_inference_beam
-
-class OnnxHFRunner(PolygraphyOnnxRunner, GenerationMixin):
- """Runner that adds interop support for HF and HF provided greedy_search functions."""
-
- def __init__(self, engine_fpath: str, network_metadata: NetworkMetadata, tfm_config: PretrainedConfig):
- super().__init__(engine_fpath, network_metadata)
- # required for greedy search used by generation mixin
- self.config = tfm_config
-
-class BARTOnnxEncoder(OnnxHFRunner):
- """OnnxRT implemented network interface that is mainly to check correctness."""
-
- def forward(self, input_ids, *args, **kwargs):
- # Unoptimized unconditional transfer to numpy for interfacing with polygraphy
- input_ids = input_ids.cpu().numpy().astype("int64")
- return torch.from_numpy(self.trt_context.infer({"input_ids": input_ids})["hidden_states"])
-
-class BARTOnnxDecoder(OnnxHFRunner):
- def prepare_inputs_for_generation(self, input_ids, **kwargs):
- return {
- "input_ids": input_ids,
- "encoder_hidden_states": kwargs["encoder_hidden_states"],
- }
-
- def forward(self, input_ids, encoder_hidden_states, *args, **kwargs):
- # Unoptimized unconditional transfer to numpy for interfacing with polygraphy
- input_ids = input_ids.cpu().numpy().astype("int64")
- encoder_hidden_states = encoder_hidden_states.cpu().numpy().astype("float32")
-
- logits = self.trt_context.infer(
- {"input_ids": input_ids, "encoder_hidden_states": encoder_hidden_states}
- )["hidden_states"]
-
- return Seq2SeqLMOutput(logits=torch.from_numpy(logits))
-
-class BARTONNXRT(OnnxRTCommand):
- def __init__(self):
- super().__init__(
- BARTModelTRTConfig,
- "Runs polygraphy results for BART model.",
- BARTHuggingFace,
- )
- self.BART_ort_decoder = None
- self.BART_ort_encoder = None
-
- def cleanup(
- self,
- workspace: NNFolderWorkspace,
- keep_onnx_model: bool = False,
- keep_torch_model: bool = False,
- ) -> None:
- # Deactivates context
- if self.BART_ort_encoder:
- self.BART_ort_encoder.release()
- if self.BART_ort_decoder:
- self.BART_ort_decoder.release()
-
- self.frameworks_cmd.cleanup(workspace, keep_onnx_model, keep_torch_model)
-
- def execute_inference(
- self,
- metadata: NetworkMetadata,
- onnx_fpaths: Dict[str, NetworkModel],
- inference_input: str,
- timing_profile: TimingProfile,
- batch_size: int = 1,
- num_beams: int = 1,
- benchmarking_mode: bool = False,
- benchmarking_args: BARTBenchmarkingArgs = None,
- ) -> NetworkResult:
-
- if "mbart" not in metadata.variant:
- tokenizer = BartTokenizer.from_pretrained(metadata.variant)
- else:
- tokenizer = MBart50Tokenizer.from_pretrained(metadata.variant, src_lang="en_XX")
-
- # Prepare the input tokens and find out output sequence length.
- if not benchmarking_mode:
- output_seq_len = BARTModelTRTConfig.MAX_OUTPUT_LENGTH[metadata.variant]
- input_ids = tokenizer([inference_input] * batch_size, padding=True, return_tensors="pt").input_ids
- else:
- max_seq_len = BARTModelTRTConfig.MAX_SEQUENCE_LENGTH[metadata.variant]
- input_seq_len = benchmarking_args.input_seq_len if benchmarking_args.input_seq_len > 0 else max_seq_len
- output_seq_len = benchmarking_args.output_seq_len if benchmarking_args.output_seq_len > 0 else max_seq_len
- input_ids = torch.randint(0, BARTModelTRTConfig.VOCAB_SIZE[metadata.variant], (batch_size, input_seq_len))
-
- encoder_last_hidden_state, encoder_e2e_time = encoder_inference(
- self.BART_ort_encoder, input_ids, timing_profile
- )
-
- # Need to feed the decoder a new empty input_ids for text generation.
- decoder_output_len = output_seq_len // 2 if (not metadata.other.kv_cache) else 1
- decoder_input_ids = torch.full(
- (batch_size, decoder_output_len), tokenizer.convert_tokens_to_ids(tokenizer.pad_token), dtype=torch.int32
- )
-
- _, decoder_e2e_time = decoder_inference(
- self.BART_ort_decoder,
- decoder_input_ids,
- encoder_last_hidden_state,
- timing_profile,
- use_cuda=False,
- )
-
- if num_beams == 1:
- decoder_output, full_e2e_runtime = full_inference_greedy(
- self.BART_ort_encoder,
- self.BART_ort_decoder,
- input_ids,
- tokenizer,
- timing_profile,
- max_length=output_seq_len,
- min_length=BARTModelTRTConfig.MIN_OUTPUT_LENGTH[metadata.variant] if not benchmarking_mode else output_seq_len,
- use_cuda=False,
- use_cache=metadata.other.kv_cache,
- batch_size=batch_size,
- )
- else:
- decoder_output, full_e2e_runtime = full_inference_beam(
- self.BART_ort_encoder,
- self.BART_ort_decoder,
- input_ids,
- tokenizer,
- timing_profile,
- num_beams=num_beams,
- max_length=output_seq_len,
- min_length=BARTModelTRTConfig.MIN_OUTPUT_LENGTH[metadata.variant] if not benchmarking_mode else outout_seq_len,
- use_cuda=False,
- use_cache=metadata.other.kv_cache,
- batch_size=batch_size,
- )
-
- # Prepare runtime results.
- runtime=[
- NetworkRuntime(
- name=BARTModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
- runtime=decoder_e2e_time,
- ),
- NetworkRuntime(
- name=BARTModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME,
- runtime=encoder_e2e_time,
- ),
- NetworkRuntime(
- name=BARTModelTRTConfig.NETWORK_FULL_NAME,
- runtime=full_e2e_runtime,
- ),
- ]
- models=NetworkModels(
- torch=None,
- onnx=list(onnx_fpaths.values()),
- trt=None
- )
-
- # Skip result checking in benchmarking mode since the input data is random.
- if benchmarking_mode:
- return BenchmarkingResult(median_runtime=runtime, models=models)
-
- # Remove the padding and end tokens.
- semantic_outputs = tokenizer.decode(
- decoder_output[-1, :], skip_special_tokens=True
- )
-
- if isinstance(semantic_outputs, list):
- semantic_outputs = " ".join(semantic_outputs).strip()
-
- return NetworkResult(
- input=inference_input,
- output_tensor=decoder_output,
- semantic_output=semantic_outputs,
- median_runtime=runtime,
- models=models,
- )
-
- def run_onnxrt(
- self,
- metadata: NetworkMetadata,
- onnx_fpaths: Tuple[NetworkModel],
- network_input: List[str],
- working_directory: str,
- keep_onnx_model: bool,
- keep_torch_model: bool,
- timing_profile: TimingProfile,
- batch_size: int = 1,
- args: object = None,
- benchmarking_mode: bool = False,
- ) -> List[NetworkResult]:
- workspace = NNFolderWorkspace(
- self.frameworks_cmd.config.network_name, metadata, working_directory
- )
-
- results = []
- try:
- if metadata.other.kv_cache:
- assert False, "OnnxRT currently does not support kv cache."
- # no fpath provided for onnx files, download them
- if len(onnx_fpaths) == 0:
- onnx_fpaths = self.frameworks_cmd.generate_and_download_framework(
- metadata, workspace
- ).onnx
- else:
- keep_onnx_model = True
- keep_torch_model = True
-
- # Output networks shall not exceed number of network segments explicitly defined by configuration file.
- assert len(onnx_fpaths) == len(
- BARTModelTRTConfig.NETWORK_SEGMENTS
- ), "There should only be {} exported ONNX segments in BART model.".format(
- len(BARTModelTRTConfig.NETWORK_SEGMENTS)
- )
-
- lookup_onnx_table = {v.name: v for v in onnx_fpaths}
-
- tfm_config = BartConfig(
- use_cache=metadata.other.kv_cache,
- num_layers=BARTModelTRTConfig.NUMBER_OF_LAYERS[metadata.variant],
- )
- self.BART_ort_encoder = BARTOnnxEncoder(
- lookup_onnx_table["encoder"].fpath, metadata, tfm_config
- )
- self.BART_ort_decoder = BARTOnnxDecoder(
- lookup_onnx_table["decoder"].fpath, metadata, tfm_config
- )
-
- if not benchmarking_mode:
- for ninput in network_input:
- results.append(
- self.execute_inference(
- metadata, lookup_onnx_table, ninput, timing_profile, batch_size, args.num_beams
- )
- )
- else:
- benchmarking_args = BARTBenchmarkingArgs(args.input_seq_len, args.output_seq_len)
- results = self.execute_inference(
- metadata, lookup_onnx_table, None, timing_profile, batch_size, args.num_beams, True, benchmarking_args
- )
-
- finally:
- self.cleanup(workspace, keep_onnx_model, keep_torch_model)
-
- return results
-
- def add_args(self, parser) -> None:
- super().add_args(parser)
- onnx_group = parser.add_argument_group("onnx models")
- onnx_group.add_argument(
- "--onnx-decoder-fpath",
- default=None,
- help="Path to ONNX decoder. If None is supplied, scripts will generate them from HuggingFace.",
- )
- onnx_group.add_argument(
- "--onnx-encoder-fpath",
- default=None,
- help="Path to ONNX encoder. If None is supplied, scripts will generate them from HuggingFace.",
- )
-
- def args_to_network_models(self, args) -> List[NetworkModel]:
- # Check if both flags are given otherwise error out
- decoder_fpath_check = args.onnx_decoder_fpath is None
- encoder_fpath_check = args.onnx_encoder_fpath is None
-
- network_models = None
- if decoder_fpath_check and encoder_fpath_check:
- network_models = tuple()
- elif decoder_fpath_check or encoder_fpath_check:
- raise self._parser.error(
- "Both --onnx-decoder-fpath and --onnx-encoder-fpath must be given. Otherwise neither should be provided for script to download them."
- )
- else:
- onnx_decoder = NetworkModel(
- name=BARTModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
- fpath=args.onnx_decoder_fpath,
- )
- onnx_encoder = NetworkModel(
- name=BARTModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME,
- fpath=args.onnx_encoder_fpath,
- )
- network_models = (onnx_decoder, onnx_encoder)
-
- return network_models
-
- def args_to_network_metadata(self, args) -> NetworkMetadata:
- """Override args to metadata to use export subroutine."""
- frameworks_parsed_metadata = self.frameworks_cmd.args_to_network_metadata(args)
-
- return NetworkMetadata(
- variant=frameworks_parsed_metadata.variant,
- precision=Precision(fp16=args.fp16),
- other=frameworks_parsed_metadata.other,
- )
-
-
-RUN_CMD = BARTONNXRT()
-
-if __name__ == "__main__":
- result = RUN_CMD()
- print("Results: {}".format(result))
diff --git a/demo/HuggingFace/BART/trt.py b/demo/HuggingFace/BART/trt.py
deleted file mode 100644
index 85bb2790..00000000
--- a/demo/HuggingFace/BART/trt.py
+++ /dev/null
@@ -1,1159 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-import os
-import sys
-import copy
-from typing import Dict, List, Tuple, Union
-from functools import reduce
-
-# Add syspath for custom library
-if __name__ == "__main__":
- filepath = os.path.dirname(os.path.abspath(__file__))
- project_root = os.path.join(filepath, os.pardir)
- sys.path.append(project_root)
-
-# polygraphy
-from polygraphy.backend.trt import Profile
-
-# tensorrt
-import tensorrt as trt
-
-# torch
-import torch
-
-# huggingface
-from transformers import BartTokenizer, BartConfig, MBart50Tokenizer
-from transformers.modeling_outputs import Seq2SeqLMOutput
-from transformers.configuration_utils import PretrainedConfig
-from transformers.generation_utils import GenerationMixin
-
-# tensorrt
-from tensorrt import PreviewFeature
-
-# TRT-HuggingFace
-from NNDF.interface import TRTInferenceCommand
-from NNDF.networks import (
- BenchmarkingResult,
- NetworkMetadata,
- NetworkModels,
- NetworkModel,
- NetworkResult,
- NetworkRuntime,
- Precision,
- TimingProfile,
-)
-
-from NNDF.tensorrt_utils import TRTNativeRunner, set_kv_data, allocate_binding_buffer, setup_benchmark_arg
-from NNDF.torch_utils import expand_inputs_for_beam_search
-from NNDF.general_utils import NNFolderWorkspace
-from BART.frameworks import BARTHuggingFace
-from BART.BARTModelConfig import BARTModelTRTConfig, BARTMetadata, BARTTRTBenchmarkingArgs
-from BART.measurements import decoder_inference, encoder_inference, full_inference_greedy, full_inference_beam, calculate_perplexity
-from BART.export import BARTDecoderONNXFile, BARTEncoderONNXFile
-from NNDF.models import TRTEngineFile
-from NNDF.logger import G_LOGGER
-
-# from HuggingFace transformers
-from transformers.generation_logits_process import (
- NoRepeatNGramLogitsProcessor,
- MinLengthLogitsProcessor,
- ForcedBOSTokenLogitsProcessor,
- ForcedEOSTokenLogitsProcessor,
- LogitsProcessorList,
-)
-from transformers.generation_stopping_criteria import (
- MaxLengthCriteria,
- StoppingCriteriaList,
-)
-from transformers.generation_beam_search import (
- BeamSearchScorer,
-)
-
-class TRTHFRunner(TRTNativeRunner, GenerationMixin):
- """Runner that adds interop support for HF and HF provided greedy_search functions."""
-
- # Stores the encoder input length received at runtime, which is used to slice decoder inputs.
- ENCODER_LENGTH = 0
- def _allocate_memory(self,
- input_shapes: Dict[str, tuple],
- input_types: Dict[str, torch.dtype],
- output_shapes: Dict[str, tuple],
- output_types: Dict[str, torch.dtype]):
- """Helper function for binding several inputs at once and pre-allocating the results."""
- # Allocate memories as 1D linear buffers for simpler handling of dynamic shapes.
- self.inputs = allocate_binding_buffer(input_types, input_shapes)
- self.outputs = allocate_binding_buffer(output_types, output_shapes)
-
- bindings = [None] * self.trt_engine.num_bindings
-
- for input_name, input_array in self.inputs.items():
- # Allocate memory for inputs
- input_idx = self.trt_engine.get_binding_index(input_name)
- self.trt_context.set_binding_shape(input_idx, input_shapes[input_name])
- bindings[input_idx] = input_array.data_ptr()
-
- assert self.trt_context.all_binding_shapes_specified
-
- for output_name, output_array in self.outputs.items():
- # Output shape should be allocated from context size
- output_idx = self.trt_engine.get_binding_index(output_name)
- bindings[output_idx] = output_array.data_ptr()
-
- return bindings
-
- def __init__(
- self,
- trt_engine_file: TRTEngineFile,
- network_metadata: NetworkMetadata,
- hf_config: PretrainedConfig,
- batch_size: int = 1
- ):
- super().__init__(trt_engine_file, network_metadata)
- self.config = hf_config
- self.batch_size = batch_size
-
-class BARTTRTEncoder(TRTHFRunner):
- """TRT implemented network interface that can be used to measure inference time."""
-
- def __init__(
- self,
- trt_engine_file: str,
- network_metadata: NetworkMetadata,
- hf_config: PretrainedConfig,
- batch_size: int = 1,
- benchmarking_args: BARTTRTBenchmarkingArgs = None
- ):
- super().__init__(trt_engine_file, network_metadata, hf_config, batch_size = batch_size)
- # In benchmarking mode, the max_sequence_length should be the designated input_profile_max_len
- if benchmarking_args is not None and benchmarking_args.input_profile_max_len is not None:
- self.max_sequence_length = benchmarking_args.input_profile_max_len
- else:
- self.max_sequence_length = BARTModelTRTConfig.MAX_SEQUENCE_LENGTH[network_metadata.variant]
- self.encoder_hidden_size = BARTModelTRTConfig.ENCODER_HIDDEN_SIZE[network_metadata.variant]
-
- # We only have one profile to select so we can just grab the profile at the start of the class
- self.profile_idx = self.get_optimization_profile(batch_size=self.batch_size, sequence_length=1)
-
- self.input_shapes = {
- "input_ids": (self.batch_size, self.max_sequence_length)
- }
- self.input_types = {
- "input_ids": torch.int32
- }
- self.output_shapes = {
- "hidden_states": (self.batch_size, self.max_sequence_length, self.encoder_hidden_size)
- }
- self.output_types = {
- "hidden_states": torch.float32
- }
- self.bindings = self._allocate_memory(self.input_shapes, self.input_types, self.output_shapes, self.output_types)
-
- def forward(self, input_ids, *args, **kwargs):
- bs = self.batch_size
- max_length = self.max_sequence_length
- TRTHFRunner.ENCODER_LENGTH = input_ids.shape[1]
- input_length = input_ids.shape[1]
- encoder_hidden_size = self.encoder_hidden_size
-
- # Check if the input data is on CPU (which usually means the PyTorch does not support current GPU).
- is_cpu_mode = (input_ids.device == torch.device("cpu"))
-
- # We allocate the buffers using max_length, but we only need to first portion of it, so copy the data into the
- # first portion of the input buffer.
- # TODO: Could we just reuse input_ids' data_ptr() as the first binding when input_ids is already contiguous to
- # avoid an additional D2D?
- if is_cpu_mode:
- self.inputs["input_ids"] = input_ids.int().flatten().contiguous().cuda()
- self.bindings[0] = self.inputs["input_ids"].data_ptr()
- else:
- self.inputs["input_ids"][:bs * input_length] = input_ids.flatten()
-
- # Set the binding shape of input_ids, which should be (bs, input_length).
- self.trt_context.set_binding_shape(0, input_ids.shape)
-
- # Launch TRT inference.
- # TODO: Could we use execute_v2_async() instead of execute_v2()?
- self.trt_context.execute_v2(bindings=self.bindings)
-
- # We allocate the buffers using max_length, but we only need to first portion of it, so get only the first
- # portion of the output buffer and return that.
- # TODO: Could we construct a Torch tensor using given data_ptr() to avoid this D2D copy?
- hidden_states_output = self.outputs["hidden_states"]
- if is_cpu_mode:
- hidden_states_output = hidden_states_output.cpu()
-
- folded = hidden_states_output[:bs * input_length * encoder_hidden_size].view(bs, input_length, encoder_hidden_size)
-
- return folded
-
-class BARTTRTDecoder(TRTHFRunner):
-
- def __init__(
- self,
- trt_engine_file: TRTEngineFile,
- network_metadata: NetworkMetadata,
- hf_config: PretrainedConfig,
- batch_size: int = 1,
- num_beams: int = 1,
- benchmarking_args: BARTTRTBenchmarkingArgs = None
- ):
- super().__init__(trt_engine_file, network_metadata, hf_config, batch_size = batch_size)
-
- # In benchmarking mode, the max_sequence_length should be the user-provided input_profile_max_len
- if benchmarking_args is not None and benchmarking_args.input_profile_max_len is not None:
- self.max_sequence_length = benchmarking_args.input_profile_max_len
- else:
- self.max_sequence_length = BARTModelTRTConfig.MAX_SEQUENCE_LENGTH[network_metadata.variant]
-
- # Similarly, the max_output_length should be the user-provided output_profile_max_len
- if benchmarking_args is not None and benchmarking_args.output_profile_max_len is not None:
- self.max_output_length = benchmarking_args.output_profile_max_len
- else:
- self.max_output_length = BARTModelTRTConfig.MAX_OUTPUT_LENGTH[network_metadata.variant]
-
- self.encoder_hidden_size = BARTModelTRTConfig.ENCODER_HIDDEN_SIZE[network_metadata.variant]
- self.num_heads = BARTModelTRTConfig.NUMBER_OF_HEADS[network_metadata.variant]
- self.embedding_size_per_head = self.encoder_hidden_size // self.num_heads
-
- # We only have one profile to select so we can just grab the profile at the start of the class
- self.profile_idx = self.get_optimization_profile(batch_size=self.batch_size * num_beams, sequence_length=1)
- input_profile_length = self.max_output_length if (not self.config.use_cache) else 1
- self.input_types = {
- "input_ids": torch.int32,
- "encoder_hidden_states": torch.float32
- }
- self.input_shapes = {
- "input_ids": (self.batch_size * num_beams, input_profile_length),
- "encoder_hidden_states": (self.batch_size * num_beams, self.max_sequence_length, self.encoder_hidden_size)
- }
-
- self.output_shapes = {
- "hidden_states": (self.batch_size * num_beams, self.max_output_length, BARTModelTRTConfig.VOCAB_SIZE[network_metadata.variant])
- }
- self.output_types = {
- "hidden_states": torch.float32
- }
-
- if self.config.use_cache:
-
- self.num_decoder_layers = BARTModelTRTConfig.NUMBER_OF_DECODER_LAYERS[network_metadata.variant]
- # Set kv cache shape and type
- for i in range(self.num_decoder_layers):
- kv_type_dict = {"encoder": torch.float32, "decoder": torch.float32}
- set_kv_data(self.input_types, "past", i, kv_type_dict)
- set_kv_data(self.output_types,"present", i, kv_type_dict)
-
- self_attention_kv_shape = (self.batch_size * num_beams, self.num_heads, self.max_output_length - 1, self.embedding_size_per_head)
- cross_attention_kv_shape = (self.batch_size * num_beams, self.num_heads, self.max_sequence_length, self.embedding_size_per_head)
- kv_shape_dict = {"encoder": cross_attention_kv_shape, "decoder": self_attention_kv_shape}
-
- set_kv_data(self.input_shapes, "past", i, kv_shape_dict)
- set_kv_data(self.output_shapes, "present", i, kv_shape_dict)
-
- self.kv_cache_binding_offset = 2 # 0: input_ids, 1: encoder_hidden_states, kv cache input indices start from 2
-
- self.bindings = self._allocate_memory(self.input_shapes, self.input_types, self.output_shapes, self.output_types)
-
- # Optimization bit
- self.persist_encoder_hidden_states = False
- self.persist_cross_attention_kv_cache = False
-
- self.use_non_kv_engine = self.config.use_cache
- # trick: set flag based on kv cache mode. This maintains code simplicity in forward() where a common codeblock is shared between non kv-cache & kv-cache modes
- # non kv-cache mode: False. Then in forward(), trt_context and bindings are set to the default ones
- # kv-cache mode: True. By default 1st decoding step starts with non-kv engine's context and binding; then flag gets updated in prepare_inputs_for_generation()
-
- self.return_device = torch.device('cuda')
-
- self.variant = network_metadata.variant # record variant name to later index the vocab_size in forward()
-
- def set_non_kv_engine_for_kv_mode(self, trt_engine_file_non_kv: TRTEngineFile):
- # same steps in tensorrt_utils.py: TRTNativeRunner
- with open(trt_engine_file_non_kv.fpath, "rb") as f:
- self.trt_engine_non_kv = self.trt_runtime.deserialize_cuda_engine(f.read())
- self.trt_context_non_kv = self.trt_engine_non_kv.create_execution_context()
-
- # Input does not have kv cache, so only inpuy_ids and encoder_hidden_states
- self.input_types_non_kv = {k: self.input_types[k] for k in ["input_ids", "encoder_hidden_states"]}
- self.input_shapes_non_kv = {k: self.input_shapes[k] for k in ["input_ids", "encoder_hidden_states"]}
-
- # Output is the same as kv
- self.output_types_non_kv = copy.deepcopy(self.output_types)
- self.output_shapes_non_kv = copy.deepcopy(self.output_shapes)
-
- # follow same steps in _allocate_memory
- self.inputs_non_kv = allocate_binding_buffer(self.input_types_non_kv, self.input_shapes_non_kv)
- self.outputs_non_kv = allocate_binding_buffer(self.output_types_non_kv, self.output_shapes_non_kv)
-
- bindings = [None] * self.trt_engine_non_kv.num_bindings
-
- for input_name, input_array in self.inputs_non_kv.items():
- # Allocate memory for inputs
- input_idx = self.trt_engine_non_kv.get_binding_index(input_name)
- self.trt_context_non_kv.set_binding_shape(input_idx, self.input_shapes_non_kv[input_name])
- bindings[input_idx] = input_array.data_ptr()
-
- assert self.trt_context_non_kv.all_binding_shapes_specified
-
- for output_name, output_array in self.outputs_non_kv.items():
- # Output shape should be allocated from context size
- output_idx = self.trt_engine_non_kv.get_binding_index(output_name)
- bindings[output_idx] = output_array.data_ptr()
-
- self.bindings_non_kv = bindings
-
- G_LOGGER.info("Non-KV cache engine setup is successful in KV cache mode.")
-
- def set_encoder_hidden_states_for_inference_cycle(self, encoder_hidden_states):
- """Used to cache encoder hidden state runs across same encoder sessions"""
- self.persist_encoder_hidden_states = True
-
- bs = encoder_hidden_states.shape[0] # in beam search mode, bs is batch_size * num_beams
- encoder_hidden_size = self.encoder_hidden_size
- encoder_length = TRTHFRunner.ENCODER_LENGTH
- if encoder_hidden_states.device == torch.device("cpu"):
- self.inputs["encoder_hidden_states"] = encoder_hidden_states.flatten().contiguous().cuda()
- self.bindings[1] = self.inputs["encoder_hidden_states"].data_ptr()
- else:
- self.inputs["encoder_hidden_states"][:bs * encoder_length * encoder_hidden_size] = encoder_hidden_states.flatten()
-
- # for dual-engine approach in kv cache mode, set these for the non-kv engine as well
- if self.use_non_kv_engine:
- if encoder_hidden_states.device == torch.device("cpu"):
- self.inputs_non_kv["encoder_hidden_states"] = encoder_hidden_states.flatten().contiguous().cuda()
- self.bindings_non_kv[1] = self.inputs_non_kv["encoder_hidden_states"].data_ptr()
- else:
- self.inputs_non_kv["encoder_hidden_states"][:bs * encoder_length * encoder_hidden_size] = encoder_hidden_states.flatten()
-
- def set_cross_attention_kv_cache_for_inference_cycle(self, past_key_values):
- """
- Used to cache encoder-decoder cross attention kv caches across same encoder sessions.
-
- Unlike self-attention cache, cross attention is constant during the decoding process, so we only need to set its bindings once at the first decoding step, and skip in all later steps (by self.persist_cross_attention_kv_cache flag)
- """
- self.persist_cross_attention_kv_cache = True
-
- bs = past_key_values[0][0].shape[0] # In beam search, it should be batch_size * num_beams
- encoder_length = TRTHFRunner.ENCODER_LENGTH if past_key_values is not None else 0
- num_heads = self.num_heads
- embedding_size_per_head = self.embedding_size_per_head
-
- for i in range(self.num_decoder_layers):
-
- # Set the binding shape of cross-attention KV caches, which should be (bs, num_heads, encoder_length, embedding_size_per_head).
- cross_attention_kv_shape = (bs, num_heads, encoder_length, embedding_size_per_head)
- cross_attention_kv_flatten_length = bs * num_heads * encoder_length * embedding_size_per_head
-
- if past_key_values is not None:
- if past_key_values[0][0].device == torch.device("cpu"):
- self.inputs[f"past_key_values.{i}.encoder.key"] = past_key_values[i][2].flatten().contiguous().cuda()
- self.bindings[self.kv_cache_binding_offset+4*i+2] = self.inputs[f"past_key_values.{i}.encoder.key"].data_ptr()
-
- self.inputs[f"past_key_values.{i}.encoder.value"] = past_key_values[i][3].flatten().contiguous().cuda()
- self.bindings[self.kv_cache_binding_offset+4*i+3] = self.inputs[f"past_key_values.{i}.encoder.value"].data_ptr()
- else:
- self.inputs[f"past_key_values.{i}.encoder.key"][:cross_attention_kv_flatten_length] = past_key_values[i][2].flatten()
-
- self.inputs[f"past_key_values.{i}.encoder.value"][:cross_attention_kv_flatten_length] = past_key_values[i][3].flatten()
-
- self.trt_context.set_binding_shape(self.kv_cache_binding_offset+4*i + 2, cross_attention_kv_shape)
- self.trt_context.set_binding_shape(self.kv_cache_binding_offset+4*i + 3, cross_attention_kv_shape)
-
- def set_return_device(self, return_device):
- """
- Sets the return device of the return via to(). Device name should be the same as torch devices: cuda, cpu, etc.
- This is used in our measurement code.
- """
- self.return_device = return_device
-
- def _reorder_cache(self, past, beam_idx):
- reordered_past = ()
- for layer_past in past:
- # cached cross_attention states don't have to be reordered -> they are always the same
- reordered_past += (
- tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
- )
- return reordered_past
-
- def forward(self, input_ids, encoder_hidden_states, *args, **kwargs):
- # Get the batch size.
- bs = input_ids.shape[0] # in beam search mode, bs is batch_size * num_beams
-
- # Get the maximum sequence length.
- max_length = self.max_sequence_length
-
- # Get the vocab size.
- vocab_size = BARTModelTRTConfig.VOCAB_SIZE[self.variant]
-
- # Actual sequence length of the input_ids and the output hidden_states.
- input_length = input_ids.shape[1]
-
- # The sequence length of the encoder_hidden_states.
- encoder_length = TRTHFRunner.ENCODER_LENGTH
-
- # Encoder hidden size
- encoder_hidden_size = self.encoder_hidden_size
-
- # KV cache flag
- use_cache = kwargs.get("use_cache", False)
-
- # flag for switch between dual engines
- non_kv_flag = self.use_non_kv_engine or (self.config.use_cache and kwargs.get("past_key_values") is None)
- # condition 1: during e2e decoding test, based on flag
- # condition 2: during single-step decoder test, depending on whether past_key_values is empty
- # note: without --enable-kv-cache arg, this flag should remain False
-
- # denote as variable to allow switch between non-kv and kv engines in kv cache mode
- trt_context = self.trt_context_non_kv if non_kv_flag else self.trt_context
- bindings = self.bindings_non_kv if non_kv_flag else self.bindings
- inputs = self.inputs_non_kv if non_kv_flag else self.inputs
- outputs = self.outputs_non_kv if non_kv_flag else self.outputs
-
- # Check if the input data is on CPU (which usually means the PyTorch does not support current GPU).
- is_cpu_mode = (input_ids.device == torch.device("cpu")) or (self.return_device == "cpu")
-
- # We allocate the buffers using max_length, but we only need to first portion of it, so copy the data into the
- # first portion of the input buffer.
- # TODO: Could we just reuse input_ids' data_ptr() as the first binding when input_ids is already contiguous to
- # avoid an additional D2D?
- if is_cpu_mode:
- inputs["input_ids"] = input_ids.int().flatten().contiguous().cuda()
- bindings[0] = inputs["input_ids"].data_ptr()
- else:
- inputs["input_ids"][:bs * input_length] = input_ids.flatten()
-
- # Set the binding shape of input_ids, which should be (bs, input_length).
- trt_context.set_binding_shape(0, input_ids.shape)
-
- # If encoder hidden states have not been copied yet, copy the hidden states to the input buffer.
- if not self.persist_encoder_hidden_states:
- if is_cpu_mode:
- inputs["encoder_hidden_states"] = encoder_hidden_states.flatten().contiguous().cuda()
- bindings[1] = inputs["encoder_hidden_states"].data_ptr()
- else:
- inputs["encoder_hidden_states"][:bs * encoder_length * encoder_hidden_size] = encoder_hidden_states.flatten()
-
- # Set the binding shape of encoder_hidden_states, which should be (bs, encoder_length, encoder_hidden_size).
- trt_context.set_binding_shape(1, (bs, encoder_length, encoder_hidden_size))
-
- if self.config.use_cache: # or use_cache
- if non_kv_flag:
- # use non-kv engine, no additional inputs
- past_decoder_length = 0
- else:
- # use kv engine
- past_key_values = kwargs.get("past_key_values") # set by prepare_inputs_for_generation() during HF e2e pipeline; if only test decoder, need to set this field
- past_decoder_length = past_key_values[0][0].size(2)
- num_heads = self.num_heads
- embedding_size_per_head = self.embedding_size_per_head
-
- # for all BART variants, # encoder layers = # decoder layers, so just divide total # layers by 2
- for i in range(self.num_decoder_layers):
-
- # Set the binding shape of self-attention KV caches, which should be (bs, num_heads, past_decoder_length, embedding_size_per_head).
- self_attention_kv_shape = (bs, num_heads, past_decoder_length, embedding_size_per_head)
- self_attention_kv_flatten_length = bs * num_heads * past_decoder_length * embedding_size_per_head
-
- if past_key_values is not None:
- if past_key_values[0][0].device == torch.device("cpu"):
- inputs[f"past_key_values.{i}.decoder.key"] = past_key_values[i][0].flatten().contiguous().cuda()
- bindings[self.kv_cache_binding_offset+4*i] = inputs[f"past_key_values.{i}.decoder.key"].data_ptr()
-
- inputs[f"past_key_values.{i}.decoder.value"] = past_key_values[i][1].flatten().contiguous().cuda()
- bindings[self.kv_cache_binding_offset+4*i+1] = inputs[f"past_key_values.{i}.decoder.value"].data_ptr()
-
- else:
- inputs[f"past_key_values.{i}.decoder.key"][:self_attention_kv_flatten_length] = past_key_values[i][0].flatten()
-
- inputs[f"past_key_values.{i}.decoder.value"][:self_attention_kv_flatten_length] = past_key_values[i][1].flatten()
-
- trt_context.set_binding_shape(self.kv_cache_binding_offset+4*i, self_attention_kv_shape)
- trt_context.set_binding_shape(self.kv_cache_binding_offset+4*i + 1, self_attention_kv_shape)
-
- # Set the binding shape of cross-attention KV caches, which should be (bs, num_heads, encoder_length, embedding_size_per_head).
- # since cross-attention KV cache dimension is fixed, we set once at the start and skip later
- if not self.persist_cross_attention_kv_cache:
- self.set_cross_attention_kv_cache_for_inference_cycle(past_key_values)
-
- # Launch TRT inference.
- # TODO: Could we use execute_v2_async() instead of execute_v2()? Current profiling shows that there is a
- # synchronization inside TRT's inference body, so this change may not be needed.
- trt_context.execute_v2(bindings=bindings)
-
- # We allocate the buffers using max_length, but we only need to first portion of it, so get only the first
- # portion of the output buffer and return that.
- # TODO: Could we construct a Torch tensor using given data_ptr() to avoid this D2D copy?
- hidden_states_output = outputs["hidden_states"]
- if is_cpu_mode:
- hidden_states_output = hidden_states_output.cpu()
-
- folded = hidden_states_output[:bs * input_length * vocab_size].view(bs, input_length, vocab_size)
- present_key_values = None
- if self.config.use_cache:
- # 1st decoding step and steps after handle the outputs in the same way
- present_key_values = ()
- curr_decoder_length = past_decoder_length + input_length
- num_heads = self.num_heads
- embedding_size_per_head = self.embedding_size_per_head
-
- for i in range(self.num_decoder_layers):
-
- self_attention_kv_shape = (bs, num_heads, curr_decoder_length, embedding_size_per_head)
- self_attention_kv_flatten_length = bs * num_heads * curr_decoder_length * embedding_size_per_head
-
- cross_attention_kv_shape = (bs, num_heads, encoder_length, embedding_size_per_head)
- cross_attention_kv_flatten_length = bs * num_heads * encoder_length * embedding_size_per_head
-
- self_attn_k_output = outputs[f"present_key_values.{i}.decoder.key"]
- self_attn_v_output = outputs[f"present_key_values.{i}.decoder.value"]
- if is_cpu_mode:
- self_attn_k_output = self_attn_k_output.cpu()
- self_attn_v_output = self_attn_v_output.cpu()
-
- self_attn_k = self_attn_k_output[:self_attention_kv_flatten_length].view(*self_attention_kv_shape)
- self_attn_v = self_attn_v_output[:self_attention_kv_flatten_length].view(*self_attention_kv_shape)
-
- cross_attn_k = None
- cross_attn_v = None
- if is_cpu_mode or non_kv_flag:
- cross_attn_k_output = outputs[f"present_key_values.{i}.encoder.key"]
- cross_attn_v_output = outputs[f"present_key_values.{i}.encoder.value"]
- if is_cpu_mode:
- cross_attn_k_output = cross_attn_k_output.cpu()
- cross_attn_v_output = cross_attn_v_output.cpu()
- cross_attn_k = cross_attn_k_output[:cross_attention_kv_flatten_length].view(*cross_attention_kv_shape)
- cross_attn_v = cross_attn_v_output[:cross_attention_kv_flatten_length].view(*cross_attention_kv_shape)
-
- present_key_values += ((self_attn_k, self_attn_v, cross_attn_k, cross_attn_v), ) # make multi-dim tuple
-
- # Transfer predictions back from GPU to do greedy search
- return Seq2SeqLMOutput(logits=folded.to(self.return_device), past_key_values=present_key_values,)
-
- def prepare_inputs_for_generation(self, input_ids, past=None, use_cache=None, **kwargs):
- # in HuggingFace generation_utils.py, this function will be called at each decoding step, before running the decoder's forward().
- # So we can use it to set the flag indicating if this is the 1st decoding step (use non-kv engine) or steps after (use kv engine)
- # cut decoder_input_ids if past is used (with past cache, only need to process the current length 1 token)
- # also, if past exists, it means we're at > 1 decoding steps thus set non-kv engine flag to False
- if past is not None:
- input_ids = input_ids[:, -1:]
- self.use_non_kv_engine = False
-
- ret = {
- "input_ids": input_ids,
- "encoder_hidden_states": kwargs["encoder_hidden_states"],
- }
-
- if self.config.use_cache:
- ret["use_cache"] = use_cache
- ret["past_key_values"] = past
-
- return ret
-
-
-class BARTTRT(TRTInferenceCommand):
- def __init__(self):
- super().__init__(
- BARTModelTRTConfig,
- "Runs trt results for BART model.",
- BARTHuggingFace,
- )
- self.BART_trt_decoder = None
- self.BART_trt_encoder = None
-
- def cleanup(
- self,
- workspace: NNFolderWorkspace,
- keep_trt_engine: bool = False,
- keep_onnx_model: bool = False,
- keep_torch_model: bool = False,
- ) -> None:
- # Deactivates context
- if self.BART_trt_encoder:
- self.BART_trt_encoder.release()
- if self.BART_trt_decoder:
- self.BART_trt_decoder.release()
-
- if not keep_trt_engine:
- self.BART_trt_encoder_engine.cleanup()
- self.BART_trt_decoder_engine.cleanup()
- # TODO: Avoid using workspace.metadata to handle non_kv removals.
- if workspace.metadata.other.kv_cache:
- self.BART_trt_decoder_engine_non_kv.cleanup()
-
- self.frameworks_cmd.cleanup(workspace, keep_onnx_model, keep_torch_model)
-
- def setup(self, encoder, decoder):
- self.BART_trt_encoder = encoder
- self.BART_trt_decoder = decoder
-
- def generate(
- self,
- input_ids,
- min_length: int = None,
- max_length: int = None,
- num_beams: int = 1,
- use_cache: bool = False,
- early_stopping: bool = True, # Deprecated
- ):
- batch_size = input_ids.shape[0]
-
- if max_length is None:
- max_length = BARTModelTRTConfig.MAX_OUTPUT_LENGTH[self.metadata.variant]
-
- if min_length is None:
- min_length = BARTModelTRTConfig.MIN_OUTPUT_LENGTH[self.metadata.variant]
-
- stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length)])
- logits_processor = LogitsProcessorList([
- NoRepeatNGramLogitsProcessor(BARTModelTRTConfig.NO_REPEAT_NGRAM_SIZE),
- MinLengthLogitsProcessor(min_length, BARTModelTRTConfig.EOS_TOKEN_ID),
- ForcedBOSTokenLogitsProcessor(BARTModelTRTConfig.BOS_TOKEN_ID),
- ForcedEOSTokenLogitsProcessor(max_length, BARTModelTRTConfig.EOS_TOKEN_ID)
- ])
-
- decoder_input_ids = torch.full(
- (batch_size, 1), BARTModelTRTConfig.EOS_TOKEN_ID, dtype=torch.int32
- ).to("cuda")
-
- if num_beams == 1:
- G_LOGGER.info("Running full inference with greedy decoding...")
- encoder_last_hidden_state = self.BART_trt_encoder(input_ids=input_ids)
- self.BART_trt_decoder.set_encoder_hidden_states_for_inference_cycle(encoder_last_hidden_state)
- decoder_output = self.BART_trt_decoder.greedy_search(
- input_ids=decoder_input_ids,
- encoder_hidden_states=encoder_last_hidden_state,
- stopping_criteria=stopping_criteria,
- logits_processor=logits_processor,
- use_cache=use_cache
- )
- else:
- G_LOGGER.info(f"Running full inference with beam search (num_beams = {num_beams}) decoding...")
-
- beam_scorer = BeamSearchScorer(
- batch_size=batch_size,
- num_beams=num_beams,
- device="cuda",
- do_early_stopping=early_stopping,
- )
-
- decoder_input_ids = expand_inputs_for_beam_search(decoder_input_ids, expand_size=num_beams)
-
- encoder_last_hidden_state = self.BART_trt_encoder(input_ids=input_ids)
-
- encoder_last_hidden_state = expand_inputs_for_beam_search(encoder_last_hidden_state, expand_size=num_beams)
-
- self.BART_trt_decoder.set_encoder_hidden_states_for_inference_cycle(encoder_last_hidden_state)
- decoder_output = self.BART_trt_decoder.beam_search(
- input_ids=decoder_input_ids,
- beam_scorer=beam_scorer,
- encoder_hidden_states=encoder_last_hidden_state,
- stopping_criteria=stopping_criteria,
- logits_processor=logits_processor,
- use_cache=use_cache
- )
-
- self.reset_decoder_state()
-
- return decoder_output
-
- def reset_decoder_state(self):
- # During execute_inference, set_encoder_hidden_states_for_inference_cycle will be called in full_inference_greedy anyway to overwrite the saved encoder_hidden_states
- # But explicit reset this flag is still beneficial
- self.BART_trt_decoder.persist_encoder_hidden_states = False
- # Because the same decoder is used for different inputs, need to reset the flags for different inputs.
- # TODO: In BARTTRTDecoder, maybe a reset function is needed to capture this issue after each task.
- if self.metadata.other.kv_cache:
- self.BART_trt_decoder.persist_cross_attention_kv_cache = False
- self.BART_trt_decoder.use_non_kv_engine = self.metadata.other.kv_cache
-
- def execute_inference(
- self,
- metadata: NetworkMetadata,
- onnx_fpaths: Dict[str, NetworkModel],
- inference_input: str,
- timing_profile: TimingProfile,
- batch_size: int = 1,
- num_beams: int = 1,
- benchmarking_mode: bool = False,
- benchmarking_args: BARTTRTBenchmarkingArgs = None,
- ) -> Union[NetworkResult, BenchmarkingResult]:
- if "mbart" not in metadata.variant:
- tokenizer = BartTokenizer.from_pretrained(metadata.variant)
- else:
- tokenizer = MBart50Tokenizer.from_pretrained(metadata.variant, src_lang="en_XX")
-
- # Prepare the input tokens and find output sequence length.
- if not benchmarking_mode:
- output_seq_len = BARTModelTRTConfig.MAX_OUTPUT_LENGTH[metadata.variant]
- input_ids = tokenizer([inference_input] * batch_size, padding=True, return_tensors="pt").input_ids
- else:
- input_seq_len = benchmarking_args.input_seq_len
- output_seq_len = benchmarking_args.output_seq_len
-
- input_ids = torch.randint(0, BARTModelTRTConfig.VOCAB_SIZE[metadata.variant], (batch_size, input_seq_len))
-
- encoder_last_hidden_state, encoder_e2e_time = encoder_inference(
- self.BART_trt_encoder, input_ids, timing_profile
- )
-
- # Need to feed the decoder a new empty input_ids for text generation.
- decoder_output_len = output_seq_len // 2 if (not metadata.other.kv_cache) else 1
- decoder_input_ids = torch.full(
- (batch_size, decoder_output_len), tokenizer.convert_tokens_to_ids(tokenizer.pad_token), dtype=torch.int32
- )
-
- _, decoder_e2e_time = decoder_inference(
- self.BART_trt_decoder,
- expand_inputs_for_beam_search(decoder_input_ids, num_beams) if num_beams > 1 else decoder_input_ids,
- expand_inputs_for_beam_search(encoder_last_hidden_state, num_beams) if num_beams > 1 else encoder_last_hidden_state,
- timing_profile,
- use_cache=metadata.other.kv_cache,
- )
-
- if num_beams == 1:
- decoder_output, full_e2e_runtime = full_inference_greedy(
- self.BART_trt_encoder,
- self.BART_trt_decoder,
- input_ids,
- tokenizer,
- timing_profile,
- max_length=output_seq_len,
- min_length=BARTModelTRTConfig.MIN_OUTPUT_LENGTH[metadata.variant] if not benchmarking_mode else output_seq_len,
- batch_size=batch_size,
- use_cache=metadata.other.kv_cache,
- )
- else:
- decoder_output, full_e2e_runtime = full_inference_beam(
- self.BART_trt_encoder,
- self.BART_trt_decoder,
- input_ids,
- tokenizer,
- timing_profile,
- num_beams=num_beams,
- max_length=output_seq_len,
- min_length=BARTModelTRTConfig.MIN_OUTPUT_LENGTH[metadata.variant] if not benchmarking_mode else output_seq_len,
- batch_size=batch_size,
- use_cache=metadata.other.kv_cache,
- )
-
- # Prepare runtime results.
- runtime=[
- NetworkRuntime(
- name=BARTModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
- runtime=decoder_e2e_time,
- ),
- NetworkRuntime(
- name=BARTModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME,
- runtime=encoder_e2e_time,
- ),
- NetworkRuntime(
- name=BARTModelTRTConfig.NETWORK_FULL_NAME,
- runtime=full_e2e_runtime,
- ),
- ]
- models=NetworkModels(
- torch=None,
- onnx=list(onnx_fpaths.values()),
- trt=[
- NetworkModel(
- name=BARTModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
- fpath=self.BART_trt_decoder_engine.fpath,
- ),
- NetworkModel(
- name=BARTModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME,
- fpath=self.BART_trt_encoder_engine.fpath,
- ),
- ],
- )
-
- # Skip result checking in benchmarking mode since the input data is random.
- if benchmarking_mode:
- return BenchmarkingResult(median_runtime=runtime, models=models)
-
- # Remove the padding and end tokens.
- semantic_outputs = tokenizer.decode(
- decoder_output[-1, :], skip_special_tokens=True
- )
-
- if isinstance(semantic_outputs, list):
- semantic_outputs = " ".join(semantic_outputs).strip()
-
- return NetworkResult(
- input=inference_input,
- output_tensor=decoder_output,
- semantic_output=semantic_outputs,
- median_runtime=runtime,
- models=models,
- )
-
- def execute_calculate_perplexity(
- self,
- metadata: NetworkMetadata,
- encoder_input: str,
- decoder_input: str,
- batch_size: int,
- ):
- if "mbart" not in metadata.variant:
- tokenizer = BartTokenizer.from_pretrained(metadata.variant)
- else:
- tokenizer = MBart50Tokenizer.from_pretrained(metadata.variant, src_lang="en_XX")
-
- encoder_input_ids = tokenizer([encoder_input] * batch_size, padding=True, return_tensors="pt").input_ids
- decoder_input_ids = tokenizer([decoder_input] * batch_size, padding=True, return_tensors="pt").input_ids
-
- perplexity = calculate_perplexity(
- self.BART_trt_encoder, self.BART_trt_decoder, tokenizer, encoder_input_ids, decoder_input_ids,
- BARTModelTRTConfig.MAX_SEQUENCE_LENGTH[metadata.variant],
- )
- return perplexity
-
- def _setup_engines(
- self,
- metadata: NetworkMetadata,
- hash_onnx_fpath: Dict[str, NetworkModel],
- batch_size: int,
- num_beams: int,
- disable_preview_dynamic_shapes: bool,
- benchmarking_args: BARTTRTBenchmarkingArgs = None,
- seq_tag: bool = False, # whether the benchmark engine tag format should be seq or max
- ) -> None:
-
- # Output networks shall not exceed number of network segments explicitly defined by configuration file.
- assert len(hash_onnx_fpath) == len(
- BARTModelTRTConfig.NETWORK_SEGMENTS
- ), "There should only be {} exported ONNX segments in BART model.".format(
- len(BARTModelTRTConfig.NETWORK_SEGMENTS)
- )
-
- decoder_onnx_fpath = hash_onnx_fpath[
- BARTModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME
- ].fpath
- encoder_onnx_fpath = hash_onnx_fpath[
- BARTModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME
- ].fpath
-
- # Generate optimization profiles.
- # non-benchmarking mode: opt profile length is by default half of the max profile
- # benchmarking mode: user can specify opt and max profile by flags. If no additional benchmarking flags are provided, it will just use the non-benchmarking mode defaults
- max_sequence_length = BARTModelTRTConfig.MAX_SEQUENCE_LENGTH[metadata.variant]
- max_output_length = BARTModelTRTConfig.MAX_OUTPUT_LENGTH[metadata.variant]
- opt_input_seq_len = max_sequence_length // 2
- opt_output_seq_len = max_output_length // 2
-
- # benchmarking flags
- if benchmarking_args is not None:
- max_sequence_length = benchmarking_args.input_profile_max_len
- max_output_length = benchmarking_args.output_profile_max_len
- opt_input_seq_len = benchmarking_args.input_seq_len
- opt_output_seq_len = benchmarking_args.output_seq_len
-
- encoder_hidden_size = BARTModelTRTConfig.ENCODER_HIDDEN_SIZE[metadata.variant]
-
- encoder_profiles = [
- Profile().add(
- "input_ids",
- min=(batch_size, 1),
- opt=(batch_size, opt_input_seq_len),
- max=(batch_size, max_sequence_length),
- )
- ]
-
- # Set up the non kv engine, used for non-kv mode and kv mode generation phase (1st decoder run uses the non-kv profile to generate kv cache)
- dec_profiles_non_kv = Profile()
-
- # for beam search, decoder engine's inputs are expanded `num_beams` times
- # optimization profiles should be changed accordingly, but onnx models can be shared across greedy/beam because the first dim (batch size) is already a dynamic value, so no change needed in export.py
- if not metadata.other.kv_cache:
- dec_profiles_non_kv = dec_profiles_non_kv.add(
- "input_ids",
- min=(batch_size * num_beams, 1),
- opt=(batch_size * num_beams, opt_output_seq_len),
- max=(batch_size * num_beams, max_output_length),
- )
- else:
- dec_profiles_non_kv = dec_profiles_non_kv.add(
- "input_ids",
- min=(batch_size * num_beams, 1),
- opt=(batch_size * num_beams, 1),
- max=(batch_size * num_beams, 1),
- )
-
- dec_profiles_non_kv = dec_profiles_non_kv.add(
- "encoder_hidden_states",
- min=(batch_size * num_beams, 1, encoder_hidden_size),
- opt=(batch_size * num_beams, opt_input_seq_len, encoder_hidden_size),
- max=(batch_size * num_beams, max_sequence_length, encoder_hidden_size),
- )
-
- decoder_profiles_non_kv = [dec_profiles_non_kv]
- dec_profiles_kv = copy.deepcopy(dec_profiles_non_kv)
- if metadata.other.kv_cache:
-
- num_heads = BARTModelTRTConfig.NUMBER_OF_HEADS[metadata.variant]
- embedding_size_per_head = encoder_hidden_size // num_heads
- num_decoder_layers = BARTModelTRTConfig.NUMBER_OF_DECODER_LAYERS[metadata.variant]
-
- self_attention_profile = {
- "min": (batch_size * num_beams, num_heads, 0, embedding_size_per_head),
- "opt": (batch_size * num_beams, num_heads, opt_output_seq_len - 1, embedding_size_per_head),
- "max": (batch_size * num_beams, num_heads, max_output_length - 1, embedding_size_per_head),
- }
- cross_attention_profile = {
- "min": (batch_size * num_beams, num_heads, 1, embedding_size_per_head),
- "opt": (batch_size * num_beams, num_heads, opt_input_seq_len, embedding_size_per_head),
- "max": (batch_size * num_beams, num_heads, max_sequence_length, embedding_size_per_head),
- }
-
- for i in range(num_decoder_layers):
- dec_profiles_kv = dec_profiles_kv.add(
- f"past_key_values.{i}.decoder.key",
- **self_attention_profile
- )
- dec_profiles_kv = dec_profiles_kv.add(
- f"past_key_values.{i}.decoder.value",
- **self_attention_profile
- )
- dec_profiles_kv = dec_profiles_kv.add(
- f"past_key_values.{i}.encoder.key",
- **cross_attention_profile
- )
- dec_profiles_kv = dec_profiles_kv.add(
- f"past_key_values.{i}.encoder.value",
- **cross_attention_profile
- )
- decoder_profiles_kv = [dec_profiles_kv]
-
- decoder_profiles = decoder_profiles_kv if (metadata.other.kv_cache) else decoder_profiles_non_kv
-
- # Convert ONNX models to TRT engines.
- if benchmarking_args is None:
- engine_tag = "bs{}".format(batch_size)
- # When user does not input any profile_max_len, use seq as tag, both max are config max
- elif seq_tag:
- engine_tag = "bs{}-inseq{}-outseq{}".format(batch_size, benchmarking_args.input_seq_len, benchmarking_args.output_seq_len)
- # When user input profile_max_len, reuse the engine for future use with different seq_len
- else:
- engine_tag = "bs{}-inmax{}-outmax{}".format(batch_size, benchmarking_args.input_profile_max_len, benchmarking_args.output_profile_max_len)
-
- if num_beams > 1:
- engine_tag += "-beam{}".format(num_beams)
-
- preview_features = [PreviewFeature.DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]
- if disable_preview_dynamic_shapes:
- engine_tag += "-noPreviewFasterDynamicShapes"
- else:
- preview_features.append(PreviewFeature.FASTER_DYNAMIC_SHAPES_0805)
-
- self.BART_trt_encoder_engine = BARTEncoderONNXFile(
- encoder_onnx_fpath, metadata
- ).as_trt_engine(
- encoder_onnx_fpath + "-{}.engine".format(engine_tag).replace(f"-beam{num_beams}", ""), # encoder engine name not affected by beam search
- profiles=encoder_profiles,
- preview_features=preview_features
- )
-
- if not metadata.other.kv_cache:
- self.BART_trt_decoder_engine = BARTDecoderONNXFile(
- decoder_onnx_fpath, metadata
- ).as_trt_engine(
- os.path.splitext(decoder_onnx_fpath)[0] + "-{}.engine".format(engine_tag),
- profiles=decoder_profiles,
- preview_features=preview_features
- )
- else:
- decoder_root, decoder_fullname = os.path.split(decoder_onnx_fpath)
- # Split kv and non kv engines into separate folders to avoid weight overlap
- non_kv_root = os.path.join(decoder_root, "non-kv")
- kv_root = os.path.join(decoder_root, "kv")
- decoder_name, decoder_ext = os.path.splitext(decoder_fullname)
- decoder_onnx_non_kv_fpath = os.path.join(non_kv_root, decoder_name + "-non-kv" + decoder_ext)
- decoder_onnx_kv_fpath = os.path.join(kv_root, decoder_fullname)
- self.BART_trt_decoder_engine = BARTDecoderONNXFile(
- decoder_onnx_kv_fpath, metadata
- ).as_trt_engine(
- os.path.splitext(decoder_onnx_kv_fpath)[0] + "-{}.engine".format(engine_tag),
- profiles=decoder_profiles,
- preview_features=preview_features
- )
- # dual-engine approach: still need to setup non-kv engine in kv mode
- # note: workspace cleanup is not handled for these extra non-kv files
- self.BART_trt_decoder_engine_non_kv = BARTDecoderONNXFile(
- decoder_onnx_non_kv_fpath, metadata
- ).as_trt_engine(
- os.path.splitext(decoder_onnx_non_kv_fpath)[0] + "-{}.engine".format(engine_tag),
- profiles=decoder_profiles_non_kv,
- preview_features=preview_features
- )
-
- # Create BARTTRTEncoder and BARTTRTDecoder instances.
- tfm_config = BartConfig(
- use_cache=metadata.other.kv_cache,
- num_layers=BARTModelTRTConfig.NUMBER_OF_LAYERS[metadata.variant],
- )
- self.BART_trt_encoder = BARTTRTEncoder(
- self.BART_trt_encoder_engine, metadata, tfm_config, batch_size=batch_size, benchmarking_args = benchmarking_args
- )
- self.BART_trt_decoder = BARTTRTDecoder(
- self.BART_trt_decoder_engine, metadata, tfm_config, batch_size=batch_size, num_beams=num_beams, benchmarking_args = benchmarking_args
- )
-
- if metadata.other.kv_cache:
- # switch between BARTTRTDecoder is impossible (becase HF decoding step is bound to one decoder). Therefore, we need to add the non-kv engines inside the same decoder --> decoder contains two TRT engines
- self.BART_trt_decoder.set_non_kv_engine_for_kv_mode(self.BART_trt_decoder_engine_non_kv)
-
- def run_trt(
- self,
- metadata: NetworkMetadata,
- onnx_fpaths: Tuple[NetworkModel],
- network_input: List[str],
- working_directory: str,
- keep_trt_engine: bool,
- keep_onnx_model: bool,
- keep_torch_model: bool,
- timing_profile: TimingProfile,
- batch_size: int = 1,
- args: object = None,
- benchmarking_mode: bool = False,
- disable_preview_dynamic_shapes: bool = False,
- perplexity_reference: List[str] = None,
- ) -> Union[List[NetworkResult], BenchmarkingResult] :
-
- self.working_directory = working_directory
- workspace = self._setup_workspace(metadata, working_directory)
-
- # Keep onnx and Torch models if they are provided by users.
- if len(onnx_fpaths) == 0:
- onnx_fpaths = self._download_models(workspace, metadata)
- else:
- keep_onnx_model = True
- keep_torch_model = True
-
- hash_onnx_fpath = {v.name: v for v in onnx_fpaths}
-
- inference_results = []
- ppl_results = []
- try:
- if not benchmarking_mode:
- self._setup_engines(metadata, hash_onnx_fpath, batch_size, args.num_beams, disable_preview_dynamic_shapes)
- for ninput in network_input:
- inference_results.append(
- self.execute_inference(
- metadata, hash_onnx_fpath, ninput, timing_profile, batch_size, args.num_beams
- )
- )
- self.reset_decoder_state()
-
- if perplexity_reference is not None:
- assert len(network_input) == len(perplexity_reference), "Encoder and decoder inputs must pair up"
-
- if metadata.other.kv_cache or (args.num_beams > 1):
- G_LOGGER.warning("Skipping perplexity calculation for TRT with KV cache or beam search because it is not supported yet.")
- else:
- for ei, di in zip(network_input, perplexity_reference):
- ppl_results.append(
- self.execute_calculate_perplexity(metadata, ei, di, batch_size)
- )
-
- else:
- # Check that input_seq_len and output_seq_len is valid and within required range
- max_input_seq_len = BARTModelTRTConfig.MAX_SEQUENCE_LENGTH[metadata.variant]
- max_output_seq_len = BARTModelTRTConfig.MAX_OUTPUT_LENGTH[metadata.variant]
-
- seq_tag = args.input_profile_max_len is None and args.output_profile_max_len is None
- # User must provide either a pair of profile_max_len or a profile of seq_len for input/output
- if args.input_profile_max_len is None or args.output_profile_max_len is None:
- if args.input_seq_len is None or args.output_seq_len is None:
- assert False, "Please provide at least one pair of inputs: [input/output]_seq_len or [input/output]_profile_max_len"
-
- input_profile_max_len = setup_benchmark_arg(args.input_profile_max_len, "input_profile_max_len", max_input_seq_len)
- output_profile_max_len = setup_benchmark_arg(args.output_profile_max_len, "output_profile_max_len", max_output_seq_len)
- input_seq_len = setup_benchmark_arg(args.input_seq_len, "input_seq_len", input_profile_max_len // 2)
- output_seq_len = setup_benchmark_arg(args.output_seq_len, "output_seq_len", output_profile_max_len // 2)
-
- benchmarking_args = BARTTRTBenchmarkingArgs(input_seq_len, output_seq_len, input_profile_max_len, output_profile_max_len)
-
- # Assert to ensure the validity of benchmarking arguments
- assert benchmarking_args.input_seq_len <= benchmarking_args.input_profile_max_len, "input_seq_len should <= input_profile_max_len = {} for benchmarking mode".format(benchmarking_args.input_profile_max_len)
- assert benchmarking_args.output_seq_len <= benchmarking_args.output_profile_max_len, "output_seq_len should <= output_profile_max_len = {} for benchmarking mode".format(benchmarking_args.output_profile_max_len)
- assert benchmarking_args.input_profile_max_len <= max_input_seq_len, "Model config restrict input_profile_max_len <= {} for benchmark mode".format(max_input_seq_len)
- assert benchmarking_args.output_profile_max_len <= max_output_seq_len, "Model config restrict output_profile_max_len <= {} for benchmark mode".format(max_output_seq_len)
-
- self._setup_engines(metadata, hash_onnx_fpath, batch_size, args.num_beams, disable_preview_dynamic_shapes, benchmarking_args, seq_tag)
- inference_results = self.execute_inference(
- metadata, hash_onnx_fpath, None, timing_profile, batch_size, args.num_beams, True, benchmarking_args
- )
-
- finally:
- self.cleanup(workspace, keep_trt_engine, keep_onnx_model, keep_torch_model)
-
- return inference_results, ppl_results
-
- def add_args(self, parser) -> None:
- super().add_args(parser)
- polygraphy_group = parser.add_argument_group("polygraphy models")
- polygraphy_group.add_argument(
- "--onnx-decoder-fpath",
- default=None,
- help="Path to ONNX decoder. If None is supplied, scripts will generate them from HuggingFace.",
- )
- polygraphy_group.add_argument(
- "--onnx-encoder-fpath",
- default=None,
- help="Path to ONNX encoder. If None is supplied, scripts will generate them from HuggingFace.",
- )
-
- def args_to_network_models(self, args) -> List[NetworkModel]:
- # Check if both flags are given otherwise error out
- decoder_fpath_check = args.onnx_decoder_fpath is None
- encoder_fpath_check = args.onnx_encoder_fpath is None
-
- network_models = None
- if decoder_fpath_check and encoder_fpath_check:
- network_models = tuple()
- elif decoder_fpath_check or encoder_fpath_check:
- raise self._parser.error(
- "Both --onnx-decoder-fpath and --onnx-encoder-fpath must be given. Otherwise neither should be provided for script to download them."
- )
- else:
- onnx_decoder = NetworkModel(
- name=BARTModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
- fpath=args.onnx_decoder_fpath,
- )
- onnx_encoder = NetworkModel(
- name=BARTModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME,
- fpath=args.onnx_encoder_fpath,
- )
- network_models = (onnx_decoder, onnx_encoder)
-
- return network_models
-
- def args_to_network_metadata(self, args) -> NetworkMetadata:
- frameworks_parsed_metadata = self.frameworks_cmd.args_to_network_metadata(args)
-
- return NetworkMetadata(
- variant=frameworks_parsed_metadata.variant,
- precision=Precision(fp16=args.fp16),
- other=frameworks_parsed_metadata.other,
- )
-
-
-RUN_CMD = BARTTRT()
-
-if __name__ == "__main__":
- result = RUN_CMD()
- print("Results: {}".format(result))
diff --git a/demo/HuggingFace/CHANGELOG.md b/demo/HuggingFace/CHANGELOG.md
deleted file mode 100644
index 188e3f45..00000000
--- a/demo/HuggingFace/CHANGELOG.md
+++ /dev/null
@@ -1,78 +0,0 @@
-# HF-OSS Demo changelog
-
-Uses [changelog conventions](https://keepachangelog.com/en/1.0.0/).
-Uses [semantic versioning](https://semver.org/).
-
-## Guiding Principles
-- Changelogs are for humans, not machines.
-- There should be an entry for every single version.
-- The same types of changes should be grouped.
-- Versions and sections should be linkable.
-- The latest version comes first.
-- The release date of each version is displayed.
-- Mention whether you follow Semantic Versioning.
-
-## Types of changes
-- `Added` for new features.
-- `Changed` for changes in existing functionality.
-- `Deprecated` for soon-to-be removed features.
-- `Removed` for now removed features.
-- `Fixed` for any bug fixes.
-- `Security` in case of vulnerabilities.
-
-# [1.3.4] - 2023-02-02
-- Changed GPT2 demo kv cache TRT to 1 engine, 2 optimization profiles
-- Added fp16 support for GPT2
-
-# [1.3.3] - 2023-01-04
-- Deprecated max workspace size flag to memory pool limits for TensorRT
-- Added t5-11b support
-- Changed T5 demo kv cache TRT memory organization to avoid D2D copy
-
-# [1.3.2] - 2022-11-17
-- Added beam search support for GPT2 demo
-- Added KV cache support for GPT2 demo
-- Fixed perplexity calculation array size out of max_length
-- Fixed trt KV cache engine profile to only accept input_length = 1
-- Fixed external onnx weight file name overwrite issue
-
-# [1.3.1] - 2022-11-04
-- Added beam search support for T5 demo
-- Added KV cache support for T5 demo
-
-# [1.3.0] - 2022-11-03
-- Added perplexity calculation for all samples
-- Added precision override to checkpoints.
-- Fixed TensorRT BART checkpoint not working.
-
-# [1.2.5] - 2022-10-08
-- Added beam search support for BART
-
-# [1.2.4] - 2022-09-30
-- Added notebooks for BART demo
-- Enabled flexible control on (a) percentile latency reports (b) engine building profile other than standard maximum input/output length config
-
-# [1.2.3] - 2022-06-30
-- Added KV cache support for BART demo
-
-# [1.2.2] - 2022-06-14
-- Added BART demo
-
-# [1.2.1] - 2022-05-20
-
-- Added `benchmark` action to T5 frameworks/onnxrt and GPT2 frameworks/trt for performance benchmarking. It uses random
- inputs with fixed lengths and disables early stopping such that we can compare the performance with other frameworks.
-- Added `batch_size > 1` support to GPT2 trt sample.
-
-# [1.2.0] - 2022-03-29
-
-- Added `benchmark` action to T5 trt for performance benchmarking. It uses random inputs with fixed lengths and disables
- early stopping such that we can compare the performance with other frameworks.
-
-# [1.1.0] - 2022-02-09
-
-- Added `-o` or `--save-output-fpath` which saves a pickled version of the `NetworkResult` object. Useful for testing.
-
-# [1.0.0] - 2022
-
-- Added initial working example of HF samples and notebooks.
diff --git a/demo/HuggingFace/GPT2/GPT2ModelConfig.py b/demo/HuggingFace/GPT2/GPT2ModelConfig.py
deleted file mode 100644
index a0edca9f..00000000
--- a/demo/HuggingFace/GPT2/GPT2ModelConfig.py
+++ /dev/null
@@ -1,198 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-import argparse
-
-from collections import namedtuple, OrderedDict
-from itertools import product
-from typing import Dict
-
-# TRT-HuggingFace
-from NNDF.networks import Precision, NetworkMetadata, NNConfig, Dims
-from NNDF.interface import MetadataArgparseInteropMixin
-
-# Limitation of namedtuples. You must declare namedtuples in module scope and not in classes.
-# Otherwise pickle doesn't work.
-# See: https://stackoverflow.com/questions/4677012/python-cant-pickle-type-x-attribute-lookup-failed
-_GPT2Metadata = namedtuple("GPT2Metadata", ["kv_cache"])
-
-
-class GPT2Metadata(_GPT2Metadata, MetadataArgparseInteropMixin):
- @staticmethod
- def add_args(parser: argparse.ArgumentParser) -> None:
- """Add commandline interface parser."""
- network_group = parser.add_argument_group("GPT2 network")
- network_group.add_argument(
- "--variant",
- help="GPT2 variant to generate",
- choices=GPT2ModelTRTConfig.TARGET_MODELS,
- required=True,
- )
- network_group.add_argument(
- "--enable-kv-cache",
- help="GPT2 enable KV cache",
- action="store_true",
- default=False,
- )
- network_group.add_argument(
- "--num-beams", type=int, default=1, help="Enables beam search during decoding."
- )
-
- network_group.add_argument(
- "--fp16", action="store_true", help="Enables fp16 TensorRT tactics."
- )
-
- @staticmethod
- def from_args(args: argparse.Namespace):
- return NetworkMetadata(
- variant=args.variant,
- precision=Precision(fp16=args.fp16),
- other=GPT2Metadata(kv_cache=args.enable_kv_cache),
- )
-
- @staticmethod
- def add_benchmarking_args(parser: argparse.ArgumentParser) -> None:
- benchmarking_group = parser.add_argument_group("benchmarking group")
- benchmarking_group.add_argument(
- "--input-seq-len",
- type=int,
- help="Specify fixed input sequence length for perf benchmarking.",
- )
- benchmarking_group.add_argument(
- "--output-seq-len",
- type=int,
- help="Specify fixed output sequence length for perf benchmarking.",
- )
-
-
-GPT2BenchmarkingArgs = namedtuple("GPT2BenchmarkingArgs", ["input_seq_len", "output_seq_len"])
-GPT2TRTBenchmarkingArgs = namedtuple("GPT2BenchmarkingArgs", ["input_seq_len", "output_seq_len", "input_profile_max_len", "output_profile_max_len"])
-
-
-class GPT2ModelTRTConfig(NNConfig):
- TARGET_MODELS = ["gpt2", "gpt2-medium", "gpt2-large", "gpt2-xl", "EleutherAI/gpt-j-6B"]
- NETWORK_DECODER_SEGMENT_NAME = "gpt2_decoder"
- NETWORK_SEGMENTS = [NETWORK_DECODER_SEGMENT_NAME]
- NETWORK_FULL_NAME = "full"
-
- NUMBER_OF_LAYERS = {
- TARGET_MODELS[0]: 12,
- TARGET_MODELS[1]: 24,
- TARGET_MODELS[2]: 36,
- TARGET_MODELS[3]: 48,
- TARGET_MODELS[4]: 28,
- }
-
- # This corresponds to max_length in task_specific_params for text-generation.
- # Both input and output length should not exceed 50.
- MAX_LENGTH = {
- TARGET_MODELS[0]: 50,
- TARGET_MODELS[1]: 50,
- TARGET_MODELS[2]: 50,
- TARGET_MODELS[3]: 50,
- TARGET_MODELS[4]: 50,
- }
-
- MIN_OUTPUT_LENGTH = {
- TARGET_MODELS[0]: 0,
- TARGET_MODELS[1]: 0,
- TARGET_MODELS[2]: 0,
- TARGET_MODELS[3]: 0,
- TARGET_MODELS[4]: 0,
- }
-
- def __init__(self):
- precision_fp16 = [False, True]
- kv_caches = [False, True]
- variants = []
- for variant, fp16, kv_cache in product(
- GPT2ModelTRTConfig.TARGET_MODELS, precision_fp16, kv_caches
- ):
- variants.append(
- NetworkMetadata(
- variant=variant,
- precision=Precision(fp16=fp16),
- other=GPT2Metadata(kv_cache=kv_cache),
- )
- )
-
- super().__init__("GPT2", variants=variants)
-
- def get_python_requirements(self):
- base_requirements = super().get_python_requirements()
- base_requirements.append('transformers==4.20.0; python_version>="3.7"')
- base_requirements.append('transformers==4.18.0; python_version<"3.7"')
- return base_requirements
-
- def get_metadata_string(self, metadata: NetworkMetadata) -> str:
- # Remove redundant GPT2 name
- metadata = metadata._replace(variant=metadata.variant.lstrip("GPT2-"))
- metadata = metadata._replace(variant=metadata.variant.lstrip("EleutherAI/"))
- return super().get_metadata_string(metadata)
-
- @staticmethod
- def get_input_dims(metadata) -> Dict:
- """
- Returns dictionary encoding of input dimensions.
- Returns:
- (Dict[str, Dims]): {"decoder": Dims}
- """
- decoder_inputs_dict = OrderedDict({"input_ids": (Dims.BATCH, Dims.SEQUENCE)})
- if metadata.other.kv_cache:
- # for KV cache version, we need add per-layer KV cache inputs. `past_key_values` at each layer is (self-attention K, self-attention V)
- for i in range(GPT2ModelTRTConfig.NUMBER_OF_LAYERS[metadata.variant]):
- # decoder self-attention KV cache (dim[0] & dim[2] are dynamic, and dim[2] varies at each decoding timestep)
- self_attention_past_kv_dims = (Dims.BATCH, "num_heads", Dims.create_new_sequence_dim("past_decoder_length"), "embedding_size_per_head")
- decoder_inputs_dict[f"past_key_values.{i}.decoder.key"] = self_attention_past_kv_dims
- decoder_inputs_dict[f"past_key_values.{i}.decoder.value"] = self_attention_past_kv_dims
-
- decoder_inputs = Dims(decoder_inputs_dict)
-
- return {
- GPT2ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME: decoder_inputs
- }
-
- @staticmethod
- def get_output_dims(metadata) -> Dict:
- """
- Returns dictionary encoding of output dimensions.
-
- Returns:
- (Dict[str, Dims]): {"decoder": Dims, "encoder": Dims}
- """
- decoder_outputs_dict = OrderedDict(
- {
- "logits": (
- Dims.BATCH,
- Dims.SEQUENCE,
- "vocab_size"
- )
- }
- )
- if metadata.other.kv_cache:
- # for KV cache version, we need add per-layer KV cache inputs. `past_key_values` at each layer is (self-attention K, self-attention V)
- for i in range(GPT2ModelTRTConfig.NUMBER_OF_LAYERS[metadata.variant]):
- # decoder self-attention KV cache (dim[0] & dim[2] are dynamic, and dim[2] varies at each decoding timestep)
- self_attention_present_kv_dims = (Dims.BATCH, "num_heads", Dims.create_new_sequence_dim("decoder_length"), "embedding_size_per_head")
- decoder_outputs_dict[f"present_key_values.{i}.decoder.key"] = self_attention_present_kv_dims
- decoder_outputs_dict[f"present_key_values.{i}.decoder.value"] = self_attention_present_kv_dims
-
- decoder_outputs = Dims(decoder_outputs_dict)
-
- return {
- GPT2ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME: decoder_outputs
- }
diff --git a/demo/HuggingFace/GPT2/checkpoint.toml b/demo/HuggingFace/GPT2/checkpoint.toml
deleted file mode 100644
index 4815250f..00000000
--- a/demo/HuggingFace/GPT2/checkpoint.toml
+++ /dev/null
@@ -1,108 +0,0 @@
-[GPT2.all.default.all.generate]
-
-input = '''
-TensorRT is a Deep Learning compiler used for deep learning.
-'''
-
-label = '''
-TensorRT is a Deep Learning compiler used for deep learning.\n\nThe main goal of the project is to create a tool that can be used to train neural networks.\n\nThe main goal of the project is to create a tool that can
-'''
-
-[GPT2.all.gpt2-medium.all.generate]
-
-label = '''
-TensorRT is a Deep Learning compiler used for deep learning.\n\nTensorRT is a Deep Learning compiler used for deep learning. TensorRT is a deep learning library for Python.\n\nTensorRT is a deep learning library for
-'''
-
-[GPT2.all.gpt2-large.all.generate]
-
-label = '''
-TensorRT is a Deep Learning compiler used for deep learning.\n\nTensorRT is a Deep Learning compiler used for deep learning. TensorFlow is a high-performance, open-source, cross-platform, high-performance, machine
-'''
-
-[GPT2.all.gpt2-xl.all.generate]
-
-label = '''
-TensorRT is a Deep Learning compiler used for deep learning.\n\nThe library is written in C++ and uses Boost.Python.\n\nThe library is available on GitHub.\n\nInstallation\n\nThe library is available on GitHub.\n
-'''
-
-[GPT2.all."EleutherAI/gpt-j-6B".all.generate]
-
-label = '''
-TensorRT is a Deep Learning compiler used for deep learning.\n\nTensorRT is a deep learning compiler that enables you to run deep learning models on NVIDIA GPUs.\n\nTensorRT is a deep learning compiler that enables you to run
-'''
-
-[GPT2.all.default.fp16.generate]
-
-label = '''
-TensorRT is a Deep Learning compiler used for deep learning.\n\nThe main goal of the project is to provide a way to build a deep learning framework that can be used to build a deep learning framework for a wide range of applications.\n
-'''
-
-[GPT2.all.default.all.generate_b]
-
-input = '''
-GPT-2 is a transformer based model pretrained on a large corpus.
-'''
-
-label = '''
-GPT-2 is a transformer based model pretrained on a large corpus.\n\nThe model is based on the following assumptions:\n\nThe model is based on the following assumptions:\n\nThe model is based on the following assumptions:\n
-'''
-
-[GPT2.all.gpt2-medium.all.generate_b]
-
-label = '''
-GPT-2 is a transformer based model pretrained on a large corpus.\n\nThe model is trained on a large corpus of data, and the model is trained on a large number of training examples. The model is trained on a large number
-'''
-
-[GPT2.all.gpt2-large.all.generate_b]
-
-label = '''
-GPT-2 is a transformer based model pretrained on a large corpus.\n\nThe model is trained on the following data:\n\nThe corpus consists of the following text files:\n\nThe corpus is split into two parts:\n\n
-'''
-
-[GPT2.all.gpt2-xl.all.generate_b]
-
-label = '''
-GPT-2 is a transformer based model pretrained on a large corpus.\n\nThe model is trained on the MNIST dataset, which contains over 100,000 handwritten digits. The training data is split into two parts: the training set and
-'''
-
-[GPT2.all."EleutherAI/gpt-j-6B".all.generate_b]
-
-label = '''
-GPT-2 is a transformer based model pretrained on a large corpus.\n\n- **GPT-2-PT**: The same as GPT-2 but with the pretrained model.\n\n- **
-'''
-
-[GPT2.all.default.all.generate_c]
-
-input = '''
-If I fall asleep then I am going to wake up in 8 hours.
-'''
-
-label = '''
-If I fall asleep then I am going to wake up in 8 hours.\n\nI am not going to sleep for 8 hours.\n\nI am going to wake up in 8 hours.\n\nI am going to wake up in 8 hours
-'''
-
-[GPT2.all.gpt2-medium.all.generate_c]
-
-label = '''
-If I fall asleep then I am going to wake up in 8 hours.\n\nI am going to wake up in 8 hours.\n\nI am going to wake up in 8 hours.\n\nI am going to wake up in 8 hours
-'''
-
-[GPT2.all.gpt2-large.all.generate_c]
-
-label = '''
-If I fall asleep then I am going to wake up in 8 hours.\n\nI am going to wake up in 8 hours.\n\nI am going to wake up in 8 hours.\n\nI am going to wake up in 8 hours
-'''
-
-[GPT2.all.gpt2-xl.all.generate_c]
-
-label = '''
-If I fall asleep then I am going to wake up in 8 hours.\n\nI am going to wake up in 8 hours.\n\nI am going to wake up in 8 hours.\n\nI am going to wake up in 8 hours
-'''
-
-[GPT2.all."EleutherAI/gpt-j-6B".all.generate_c]
-
-label = '''
-If I fall asleep then I am going to wake up in 8 hours.\n\nI am going to be in the same place.\n\nI am going to be in the same place.\n\nI am going to be in the same place
-'''
-
diff --git a/demo/HuggingFace/GPT2/export.py b/demo/HuggingFace/GPT2/export.py
deleted file mode 100644
index cbd06964..00000000
--- a/demo/HuggingFace/GPT2/export.py
+++ /dev/null
@@ -1,258 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Contains logic that captures GPT2 HuggingFace models into ONNX models and TRT engines.
-"""
-
-from itertools import tee
-import os
-from collections import OrderedDict
-
-# tensorrt
-import tensorrt as trt
-
-# polygraphy
-from polygraphy.backend.trt import Profile
-
-# torch
-import torch
-from torch.nn import Module
-
-# # huggingface
-from transformers.generation_utils import GenerationMixin
-from transformers.modeling_outputs import CausalLMOutputWithPast
-from transformers import GPT2Tokenizer
-
-# TRT-HuggingFace
-from GPT2.GPT2ModelConfig import GPT2ModelTRTConfig
-from NNDF.networks import NetworkMetadata, Dims
-from NNDF.logger import G_LOGGER
-from NNDF.models import (
- TRTEngineFile,
- TorchModelFile,
- ONNXModelFile,
- ModelFileConverter,
-)
-
-class GPT2TorchFile(TorchModelFile):
- class TorchModule(Module, GenerationMixin):
- """
- A simplied definition of GPT2 with LM head.
- """
-
- def __init__(self, transformer, lm_head, config):
- super().__init__()
- self.transformer = transformer
- self.lm_head = lm_head
- self.config = config
- self.device = torch.device('cuda') # WAR to avoid beam search in framework
- self.main_input_name = "input_ids" # For better HuggingFace version compatibility
-
- def prepare_inputs_for_generation(self, input_ids, past = None, use_cache=None, **kwargs):
- # Todo (@pchadha): add position_ids, token_type_ids support
- # cut decoder_input_ids if past is used
- if past is not None:
- input_ids = input_ids[:, -1:]
-
- return {
- "input_ids": input_ids,
- "use_cache": use_cache,
- "past_key_values": past
- }
-
- def forward(self, input_ids, **kwargs):
- transformer_outputs = self.transformer(input_ids, **kwargs)
- hidden_states = transformer_outputs[0]
- lm_logits = self.lm_head(hidden_states)
-
- return CausalLMOutputWithPast(
- logits=lm_logits,
- past_key_values=transformer_outputs.past_key_values
- )
-
- def _reorder_cache(self, past, beam_idx):
- """
- This function is used to re-order the :obj:`past_key_values` cache if
- :meth:`~transformers.PreTrainedModel.beam_search` or :meth:`~transformers.PreTrainedModel.beam_sample` is
- called. This is required to match :obj:`past_key_values` with the correct beam_idx at every generation step.
- """
- return tuple(
- tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past)
- for layer_past in past
- )
-
- def __call__(self, *args, **kwargs):
- return self.forward(*args, **kwargs)
-
- def __init__(self, model, network_metadata):
- super().__init__(model, GPT2Converter, network_metadata)
-
-
-class GPT2ONNXFile(ONNXModelFile):
- def __init__(self, model, network_metadata):
- super().__init__(model, GPT2Converter, network_metadata)
-
-
-# TRT Engine File Encoding #
-class GPT2TRTEngine(TRTEngineFile):
- def __init__(self, model, network_metadata):
- super().__init__(model, GPT2Converter, network_metadata)
-
- def use_obey_precision_constraints(self):
- return self.network_metadata.precision.fp16
-
- def get_network_definition(self, network_definition):
-
- def pairwise(iterable):
- a, b = tee(iterable)
- next(b, None)
- return zip(a, b)
-
- indices = list(range(0, network_definition[1].num_layers))
- for i, i_next in pairwise(indices):
- l = network_definition[1].get_layer(i)
- l_next = network_definition[1].get_layer(i_next)
-
- if not all([l.get_output(i).is_execution_tensor for i in range(l.num_outputs)]):
- continue
-
- if l.get_output_type(0) != trt.float32:
- continue
-
- if l.type == trt.LayerType.ELEMENTWISE and l_next.type == trt.LayerType.REDUCE:
- l.__class__ = getattr(trt, "IElementWiseLayer")
- if l.op == trt.ElementWiseOperation.POW:
- l.precision = trt.float32
- l.set_output_type(0, trt.float32)
-
- l_next.precision = trt.float32
- l_next.set_output_type(0, trt.float32)
-
- if self.network_metadata.precision.fp16:
- for i in range(network_definition[1].num_inputs):
- t = network_definition[1].get_input(i)
- if t.dtype == trt.float32:
- t.dtype = trt.float16
-
- for i in range(network_definition[1].num_outputs):
- t = network_definition[1].get_output(i)
- if t.dtype == trt.float32:
- t.dtype = trt.float16
-
- return network_definition
-
-# Converters
-class GPT2Converter(ModelFileConverter):
- def __init__(self):
- super().__init__(GPT2TorchFile, GPT2ONNXFile, GPT2TRTEngine)
-
- def torch_to_onnx(
- self, output_fpath: str, model: Module, network_metadata: NetworkMetadata
- ):
- """
- Exports a GPT2LMHead model to ONNX.
-
- Args:
- output_prefix (str): Path to the onnx file
- model (torch.Model): Model loaded torch class
-
- Returns:
- GPT2ONNXFile: ONNX GPT2 decoder object.
- """
- # Currently does not support exporting GPU models to onnx.
- device = model.device
- tokenizer = GPT2Tokenizer.from_pretrained(network_metadata.variant)
- input_ids = torch.tensor(
- [
- tokenizer.encode(
- "Here is some text to encode Hello World", add_special_tokens=True
- )
- ]
- ).to(device)
-
- gpt2_model = GPT2TorchFile.TorchModule(
- model.transformer, model.lm_head, model.config
- )
-
- inputs = GPT2ModelTRTConfig.get_input_dims(network_metadata)[
- GPT2ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME
- ]
- outputs = GPT2ModelTRTConfig.get_output_dims(network_metadata)[
- GPT2ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME
- ]
-
- # Exports to ONNX
- opt_args={}
-
- version_major = int((torch.__version__).split('.')[0])
- version_minor = int((torch.__version__).split('.')[1])
- if version_major < 1 or (version_major == 1 and version_minor < 11):
- opt_args['use_external_data_format'] = True
- if not network_metadata.other.kv_cache:
- # This code allows for huggingface compatible torch class to use onnx exporter
- # This code regulates the number of output = 1 if non kv-cache mode is used.
- # Otherwise it will automatically output key value pairs
- old_forward = gpt2_model.forward
- def _export_forward(input_ids, **kwargs):
- result = old_forward(input_ids, use_cache = False, **kwargs)
- return result[0]
- gpt2_model.forward = _export_forward
-
- torch.onnx.export(
- gpt2_model,
- input_ids,
- output_fpath,
- opset_version=13,
- do_constant_folding=True,
- input_names=inputs.get_names(),
- output_names=outputs.get_names(),
- dynamic_axes={
- **inputs.get_torch_dynamic_axis_encoding(),
- **outputs.get_torch_dynamic_axis_encoding(),
- },
- training=torch.onnx.TrainingMode.EVAL,
- **opt_args
- )
- else:
- decoder_output = gpt2_model(input_ids, use_cache = True)
- past_key_values = decoder_output[1]
-
- # Exporting the kv cache engine
- old_forward = gpt2_model.forward
- def _export_forward(input_ids, past_key_values, **kwargs):
- result = old_forward(input_ids, past_key_values=past_key_values, use_cache=True, **kwargs)
- return (result[0], result[1])
- gpt2_model.forward = _export_forward
-
- torch.onnx.export(
- gpt2_model,
- (input_ids, past_key_values),
- output_fpath,
- opset_version=13,
- do_constant_folding=True,
- input_names=inputs.get_names(),
- output_names=outputs.get_names(),
- dynamic_axes={
- **inputs.get_torch_dynamic_axis_encoding(),
- **outputs.get_torch_dynamic_axis_encoding(),
- },
- training=torch.onnx.TrainingMode.EVAL,
- **opt_args
- )
-
- return GPT2ONNXFile(output_fpath, network_metadata)
diff --git a/demo/HuggingFace/GPT2/frameworks.py b/demo/HuggingFace/GPT2/frameworks.py
deleted file mode 100644
index d430a056..00000000
--- a/demo/HuggingFace/GPT2/frameworks.py
+++ /dev/null
@@ -1,318 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-import os
-import sys
-import argparse
-
-from typing import List, Union
-
-# huggingface
-from transformers import (
- AutoConfig,
- AutoModelForCausalLM,
- # GPT-J uses GPT2 tokenizer
- GPT2Tokenizer,
-)
-
-# torch
-import torch
-
-# Add syspath for custom library
-if __name__ == "__main__":
- filepath = os.path.dirname(os.path.abspath(__file__))
- project_root = os.path.join(filepath, os.pardir)
- sys.path.append(project_root)
-
-# helpers
-from NNDF.interface import FrameworkCommand
-from NNDF.general_utils import confirm_folder_delete, NNFolderWorkspace
-from NNDF.networks import (
- BenchmarkingResult,
- NetworkResult,
- NetworkMetadata,
- NetworkRuntime,
- Precision,
- NetworkModel,
- NetworkModels,
- TimingProfile,
-)
-from GPT2.export import GPT2TorchFile
-from GPT2.GPT2ModelConfig import GPT2ModelTRTConfig, GPT2BenchmarkingArgs
-from GPT2.measurements import gpt2_inference, full_inference, calculate_perplexity
-
-
-class GPT2HuggingFace(FrameworkCommand):
- def __init__(self):
- super().__init__(
- GPT2ModelTRTConfig, description="Runs framework results for GPT2 model."
- )
-
- # Default inference input used during inference stage
- self.onnx_gpt2 = None
- self.torch_gpt2_dir = None
-
- def generate_and_download_framework(
- self, metadata: NetworkMetadata, workspace: NNFolderWorkspace
- ) -> NetworkModels:
-
- trt_gpt2_config = self.config
- metadata_serialized = trt_gpt2_config.get_metadata_string(metadata)
- workspace_dir, _ , onnx_root = workspace.set_model_path(metadata_serialized, is_encoder_decoder = False)
- pytorch_model_dir = os.path.join(workspace_dir, "pytorch_model")
- # We keep track of the generated torch location for cleanup later
- self.torch_gpt2_dir = pytorch_model_dir
-
- if not os.path.exists(pytorch_model_dir):
- # Generate the pre-trained weights
- model = AutoModelForCausalLM.from_pretrained(metadata.variant, use_cache = metadata.other.kv_cache)
- model.save_pretrained(pytorch_model_dir)
- print("Pytorch Model saved to {}".format(pytorch_model_dir))
- else:
- print(
- "Frameworks file already exists, skipping generation and loading from file instead."
- )
- model = AutoModelForCausalLM.from_pretrained(pytorch_model_dir)
-
- onnx_model_fpath = os.path.join(onnx_root, metadata_serialized + ".onnx")
-
- gpt2 = GPT2TorchFile(model, metadata)
- self.onnx_gpt2 = gpt2.as_onnx_model(onnx_model_fpath, force_overwrite=False)
-
- onnx_models = [
- NetworkModel(
- name=GPT2ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
- fpath=self.onnx_gpt2.fpath,
- )
- ]
- torch_models = [
- NetworkModel(
- name=GPT2ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
- fpath=pytorch_model_dir,
- )
- ]
-
- return NetworkModels(torch=torch_models, onnx=onnx_models, trt=None)
-
- def cleanup(
- self,
- workspace: NNFolderWorkspace,
- save_onnx_model: bool = True,
- keep_pytorch_model: bool = True,
- ) -> None:
- """
- Cleans up the working directory and leaves models if available.
- Should not assume any functions from the framework class has been called.
- Returns:
- None
- """
- # Clean-up generated files
- if not save_onnx_model and self.onnx_gpt2 is not None:
- self.onnx_gpt2.cleanup()
-
- if not keep_pytorch_model:
- # Using rmtree can be dangerous, have user confirm before deleting.
- confirm_folder_delete(
- self.torch_gpt2_dir,
- prompt="Confirm you want to delete downloaded pytorch model folder?",
- )
-
- if not keep_pytorch_model and not save_onnx_model:
- workspace.cleanup(force_remove=False)
-
- def setup_tokenizer_and_model(
- self,
- metadata: NetworkMetadata,
- network_fpaths: NetworkModels,
- ):
- tokenizer = GPT2Tokenizer.from_pretrained(metadata.variant)
-
- # GPT2 has no proper token set. Use custom token. Only "generate()" will auto
- # replace with EOS token when using generating mode
- tokenizer.add_special_tokens({"pad_token": "[PAD]"})
-
- # By default, HuggingFace model structure is one giant file.
- gpt2_torch_fpath = network_fpaths.torch[0].fpath
- gpt2_model = AutoModelForCausalLM.from_pretrained(gpt2_torch_fpath)
-
- # Framework fp16 does not support cpu mode for GPT2
- if metadata.precision.fp16:
- gpt2_model = gpt2_model.cuda().half()
-
- gpt2_torch = GPT2TorchFile.TorchModule(
- gpt2_model.transformer, gpt2_model.lm_head, gpt2_model.config
- )
-
- return tokenizer, gpt2_torch
-
- def execute_inference(
- self,
- metadata: NetworkMetadata,
- network_fpaths: NetworkModels,
- inference_input: str,
- timing_profile: TimingProfile,
- use_cpu: bool,
- batch_size: int = 1,
- num_beams: int = 1,
- benchmarking_mode: bool = False,
- benchmarking_args: GPT2BenchmarkingArgs = None,
- ) -> Union[NetworkResult, BenchmarkingResult]:
-
- tokenizer, gpt2_torch = self.setup_tokenizer_and_model(metadata, network_fpaths)
- config = gpt2_torch.config
- # Prepare the input tokens and find out output sequence length.
- if not benchmarking_mode:
- output_seq_len = GPT2ModelTRTConfig.MAX_LENGTH[metadata.variant]
- input_ids = tokenizer([inference_input] * batch_size, padding=True, return_tensors="pt").input_ids
- else:
- input_seq_len = benchmarking_args.input_seq_len
- output_seq_len = benchmarking_args.output_seq_len
- input_ids = torch.randint(0, config.vocab_size, (batch_size, input_seq_len))
-
- # get single decoder iteration inference timing profile
- _, decoder_e2e_time = gpt2_inference(
- gpt2_torch,
- input_ids,
- timing_profile,
- use_cuda=(not use_cpu),
- use_cache = metadata.other.kv_cache,
- )
-
- # get complete decoder inference result and its timing profile
- sample_output, full_e2e_runtime = full_inference(
- gpt2_torch,
- input_ids,
- tokenizer,
- timing_profile,
- max_length=output_seq_len,
- min_length=GPT2ModelTRTConfig.MIN_OUTPUT_LENGTH[metadata.variant] if not benchmarking_mode else output_seq_len,
- use_cuda=(not use_cpu),
- batch_size=batch_size,
- use_cache=metadata.other.kv_cache,
- num_beams=num_beams
- )
-
- # Prepare runtime results.
- runtime = [
- NetworkRuntime(
- name=GPT2ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
- runtime=decoder_e2e_time,
- ),
- NetworkRuntime(
- name=GPT2ModelTRTConfig.NETWORK_FULL_NAME,
- runtime=full_e2e_runtime,
- ),
- ]
-
- # Skip result checking in benchmarking mode since the input data is random.
- if benchmarking_mode:
- return BenchmarkingResult(median_runtime=runtime, models=network_fpaths)
-
- # Remove the padding and end tokens.
- semantic_outputs = tokenizer.decode(
- sample_output[-1, :], skip_special_tokens=True
- )
-
- if isinstance(semantic_outputs, list):
- semantic_outputs = " ".join(semantic_outputs).strip()
-
- return NetworkResult(
- input=inference_input,
- output_tensor=sample_output,
- semantic_output=semantic_outputs,
- median_runtime=runtime,
- models=network_fpaths,
- )
-
- def execute_calculate_perplexity(
- self,
- metadata: NetworkMetadata,
- network_fpaths: NetworkModels,
- reference: str,
- ):
- tokenizer, gpt2_torch = self.setup_tokenizer_and_model(metadata, network_fpaths)
- reference = reference.replace("\\n", "\n")
- ppl_input_ids = tokenizer([reference], padding=True, return_tensors="pt").input_ids
- perplexity = calculate_perplexity(
- gpt2_torch, ppl_input_ids, GPT2ModelTRTConfig.MAX_LENGTH[metadata.variant]
- )
-
- return perplexity
-
- def run_framework(
- self,
- metadata: NetworkMetadata,
- network_input: List[str],
- working_directory: str,
- keep_onnx_model: bool,
- keep_pytorch_model: bool,
- timing_profile: TimingProfile,
- use_cpu: bool = False,
- batch_size: int = 1,
- args: object = None,
- benchmarking_mode: bool = False,
- perplexity_reference: List[str] = None,
- ) -> Union[List[NetworkResult], BenchmarkingResult]:
-
- """
- Main entry point of our function which compiles and generates our model data.
- """
- inference_results = []
- ppl_results = []
- workspace = NNFolderWorkspace(
- self.config.network_name, metadata, working_directory
- )
- try:
- network_fpaths = self.generate_and_download_framework(metadata, workspace)
- if not benchmarking_mode:
- for ninput in network_input:
- inference_results.append(
- self.execute_inference(
- metadata, network_fpaths, ninput, timing_profile, use_cpu, batch_size, args.num_beams
- )
- )
- if perplexity_reference is not None:
- for r in perplexity_reference:
- ppl_results.append(
- self.execute_calculate_perplexity(
- metadata, network_fpaths, r
- )
- )
- else:
- benchmarking_args = GPT2BenchmarkingArgs(args.input_seq_len, args.output_seq_len)
- inference_results = self.execute_inference(
- metadata, network_fpaths, None, timing_profile, use_cpu, batch_size, args.num_beams, True, benchmarking_args
- )
- finally:
- self.cleanup(workspace, keep_onnx_model, keep_pytorch_model)
-
- return inference_results, ppl_results
-
- def args_to_network_metadata(self, args: argparse.Namespace) -> NetworkMetadata:
- return NetworkMetadata(
- variant=args.variant,
- precision=Precision(fp16=args.fp16),
- other=self.config.MetadataClass(kv_cache=args.enable_kv_cache),
- )
-
-
-# Entry point
-RUN_CMD = GPT2HuggingFace()
-
-if __name__ == "__main__":
- result = RUN_CMD()
- print("Results: {}".format(result))
diff --git a/demo/HuggingFace/GPT2/measurements.py b/demo/HuggingFace/GPT2/measurements.py
deleted file mode 100644
index f783f872..00000000
--- a/demo/HuggingFace/GPT2/measurements.py
+++ /dev/null
@@ -1,99 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Utils specific to GPT2 network.
-"""
-
-# torch
-import torch
-
-
-# from HuggingFace transformers
-from transformers.generation_logits_process import (
- MinLengthLogitsProcessor,
- LogitsProcessorList,
- ForcedEOSTokenLogitsProcessor,
-)
-from transformers.generation_stopping_criteria import (
- MaxLengthCriteria,
- StoppingCriteriaList,
-)
-
-# TRT-HuggingFace
-from NNDF.general_utils import measure_python_inference_code
-from NNDF.torch_utils import use_cuda
-from NNDF.tensorrt_utils import TRTNativeRunner
-
-@use_cuda
-def gpt2_inference(gpt2, input_ids, timing_profile, use_cuda=True, use_cache=False, past_key_values = None):
- gpt2_stmt = lambda: gpt2(input_ids=input_ids, use_cache=use_cache, past_key_values=past_key_values)
- gpt2_e2e_time = measure_python_inference_code(gpt2_stmt, timing_profile)
- return (gpt2_stmt(), gpt2_e2e_time)
-
-
-# Code specifically for Pythonic inference measurement used across all GPT2 related scripts
-@use_cuda
-def full_inference(
- gpt2,
- input_ids,
- tokenizer,
- timing_profile,
- max_length,
- min_length = 0,
- use_cuda=True,
- batch_size=1,
- early_stopping=False,
- use_cache=False,
- num_beams = 1,
-):
-
- if isinstance(gpt2, TRTNativeRunner):
- gpt2.set_return_device("cuda" if use_cuda else "cpu")
-
- def _e2e():
- with torch.no_grad():
- output = gpt2.generate(
- input_ids,
- max_length=max_length,
- min_length=min_length,
- batch_size=batch_size,
- num_beams=num_beams,
- use_cache=use_cache,
- early_stopping=early_stopping
- )
-
- return output
-
- full_e2e_time = measure_python_inference_code(_e2e, timing_profile)
- return (_e2e(), full_e2e_time)
-
-
-@use_cuda
-def calculate_perplexity(gpt2, input_ids, max_seq_len=None, use_cuda=True):
- if isinstance(gpt2, TRTNativeRunner):
- gpt2.set_return_device("cuda" if use_cuda else "cpu")
-
- with torch.no_grad():
- if max_seq_len is not None:
- input_ids = input_ids[:, :max_seq_len]
- logits = gpt2(input_ids).logits
- # Shift logits and target ids so that probabilities generated by token < n line up with output token n.
- shifted_logits = logits[:, :-1, :]
- target_ids = input_ids[:, 1:]
- loss = torch.nn.CrossEntropyLoss()(shifted_logits.permute((0, 2, 1)), target_ids)
- return torch.exp(loss).item()
diff --git a/demo/HuggingFace/GPT2/trt.py b/demo/HuggingFace/GPT2/trt.py
deleted file mode 100644
index 411eb72c..00000000
--- a/demo/HuggingFace/GPT2/trt.py
+++ /dev/null
@@ -1,757 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-import os
-import sys
-import copy
-from typing import Dict, List, Tuple, Union
-
-# Add syspath for custom library
-if __name__ == "__main__":
- filepath = os.path.dirname(os.path.abspath(__file__))
- project_root = os.path.join(filepath, os.pardir)
- sys.path.append(project_root)
-
-# numpy
-import numpy as np
-
-# polygraphy
-from polygraphy.backend.trt import Profile
-
-# torch
-import torch
-
-# huggingface
-from transformers import GPT2Tokenizer, AutoConfig
-from transformers.modeling_outputs import CausalLMOutputWithPast
-from transformers.configuration_utils import PretrainedConfig
-from transformers.generation_utils import GenerationMixin
-
-# tensorrt
-from tensorrt import PreviewFeature
-
-# TRT-HuggingFace
-from NNDF.interface import TRTInferenceCommand
-from NNDF.networks import (
- BenchmarkingResult,
- NetworkMetadata,
- NetworkModels,
- NetworkModel,
- NetworkResult,
- NetworkRuntime,
- Precision,
- TimingProfile,
-)
-
-from NNDF.tensorrt_utils import TRTNativeRunner, TRTPolygraphyRunner, set_kv_data, allocate_binding_buffer, setup_benchmark_arg
-from NNDF.torch_utils import expand_inputs_for_beam_search
-from GPT2.frameworks import GPT2HuggingFace
-from NNDF.general_utils import NNFolderWorkspace
-from GPT2.GPT2ModelConfig import GPT2ModelTRTConfig, GPT2BenchmarkingArgs, GPT2TRTBenchmarkingArgs
-from GPT2.measurements import gpt2_inference, full_inference, calculate_perplexity
-from GPT2.export import GPT2ONNXFile, GPT2TRTEngine
-from NNDF.models import TRTEngineFile
-from NNDF.logger import G_LOGGER
-
-class TRTHFRunner(TRTNativeRunner, GenerationMixin):
- """Runner that adds interop support for HF and HF provided greedy_search functions."""
-
- # Stores the encoder input length received at runtime, which is used to slice decoder inputs.
- ENCODER_LENGTH = 0
- def _allocate_memory(self,
- input_shapes: Dict[str, tuple],
- input_types: Dict[str, torch.dtype],
- output_shapes: Dict[str, tuple],
- output_types: Dict[str, torch.dtype]):
- """Helper function for binding several inputs at once and pre-allocating the results."""
- # Allocate memories as 1D linear buffers for simpler handling of dynamic shapes.
- self.inputs = allocate_binding_buffer(input_types, input_shapes)
- self.outputs = allocate_binding_buffer(output_types, output_shapes)
-
- bindings = [None] * self.trt_engine.num_bindings
-
- for input_name, input_array in self.inputs.items():
- # Allocate memory for inputs
- input_idx = self.trt_engine.get_binding_index(input_name)
- self.trt_context.set_binding_shape(input_idx, input_shapes[input_name])
- bindings[input_idx] = input_array.data_ptr()
-
- assert self.trt_context.all_binding_shapes_specified
-
- for output_name, output_array in self.outputs.items():
- # Output shape should be allocated from context size
- output_idx = self.trt_engine.get_binding_index(output_name)
- bindings[output_idx] = output_array.data_ptr()
-
- return bindings
-
- def __init__(
- self,
- trt_engine_file: TRTEngineFile,
- network_metadata: NetworkMetadata,
- hf_config: PretrainedConfig,
- batch_size: int = 1
- ):
- super().__init__(trt_engine_file, network_metadata)
- self.config = hf_config
- self.batch_size = batch_size
-
-class GPT2TRTDecoder(TRTHFRunner):
- def __init__(
- self,
- trt_engine_file: str,
- network_metadata: NetworkMetadata,
- hf_config: PretrainedConfig,
- batch_size: int = 1,
- num_beams: int = 1,
- benchmarking_args: GPT2BenchmarkingArgs = None
- ):
- super().__init__(trt_engine_file, network_metadata, hf_config, batch_size = batch_size)
- self.network_metadata = network_metadata
- self.data_type = torch.float32 if not network_metadata.precision.fp16 else torch.float16
- # In benchmarking mode, if input_profile_max is provided, should use that as max_sequence_length
- if benchmarking_args is not None:
- if benchmarking_args.input_profile_max_len is not None:
- self.max_input_length = benchmarking_args.input_profile_max_len
- else:
- self.max_input_length = hf_config.n_positions
- # In non-benchmarking mode, we are provided a text generation task. We need to use the max_length as max sequence length
- else:
- self.max_sequence_length = GPT2ModelTRTConfig.MAX_LENGTH[network_metadata.variant]
-
- # Similarly, the max_output_length should be the user-provided output_profile_max_len if provided
- if benchmarking_args is not None and benchmarking_args.output_profile_max_len is not None:
- self.max_output_length = benchmarking_args.output_profile_max_len
- else:
- self.max_output_length = self.max_sequence_length
-
- self.main_input_name = "input_ids"
- self.num_heads = self.config.n_head
- self.embedding_size_per_head = self.config.n_embd // self.num_heads
- self.num_decoder_layers = self.config.n_layer
-
- self.profile_idx = 0
- self.bindings = [0] * self.trt_engine.num_bindings
- self.logits = torch.zeros((self.batch_size * num_beams, self.max_output_length, hf_config.vocab_size), dtype = self.data_type).cuda()
- self.bindings[self.trt_engine.get_binding_index("logits")] = self.logits.data_ptr()
- # This will be used to calculate the offset for each binding
- self.num_bindings = self.trt_engine.num_bindings // 2 if self.config.use_cache else self.trt_engine.num_bindings
-
- if self.config.use_cache:
- self.bindings[self.trt_engine.get_binding_index("logits") + self.num_bindings] = self.logits.data_ptr()
-
- # Setting input and output the same does not work for GPT2. Needs separate cache and copy the memory address after each iteration
- self.self_attention_cache_1 = {}
- self.self_attention_cache_2 = {}
-
- self_attention_kv_shape = (self.batch_size * num_beams, self.num_heads, self.max_output_length - 1, self.embedding_size_per_head)
-
- # Set kv cache shape and type
- for i in range(self.num_decoder_layers):
- for code in ["key", "value"]:
-
- self_attention_name = f"key_values.{i}.decoder.{code}"
- kv_buffer_1 = torch.zeros(self_attention_kv_shape, dtype = self.data_type).cuda()
- kv_buffer_2 = torch.zeros(self_attention_kv_shape, dtype = self.data_type).cuda()
- self.self_attention_cache_1[self_attention_name] = kv_buffer_1
- self.self_attention_cache_2[self_attention_name] = kv_buffer_2
-
- input_idx = self.trt_engine.get_binding_index("past_" + self_attention_name)
- output_idx = self.trt_engine.get_binding_index("present_" + self_attention_name)
-
- self.bindings[input_idx] = kv_buffer_1.data_ptr() # Generation phase
- self.bindings[output_idx] = kv_buffer_2.data_ptr()
-
- # Context mode will always use buffer 1 as output
- self.bindings[input_idx + self.num_bindings] = 0 # Context phase, should be 0
- self.bindings[output_idx + self.num_bindings] = kv_buffer_1.data_ptr()
-
- self.kv_cache_binding_offset = 1 # 0: input_ids, kv cache input indices start from 1
- self.past_decoder_length = 0
- self.use_cache_1_as_input = True
- self._set_context_mode_trt_context()
-
- self.context_mode = self.config.use_cache
- self.return_device = torch.device('cuda')
- self.device = torch.device('cuda')
-
- def reset(self):
- '''
- Resets the input specific fields after finishing a task.
- '''
- self.context_mode = self.config.use_cache
-
- def _switch_input_output_binding(self):
- '''
- For kv cache mode, switch input and output pointers to avoid data concurrency issue and D2D copy
- '''
- # When context mode (output in cache 1) and cache 1 is used as inputs, no need to switch bindings
- if not (self.use_cache_1_as_input and self.context_mode):
- for i in range(self.num_decoder_layers):
- for code in ["key", "value"]:
- self_attention_name = f"key_values.{i}.decoder.{code}"
- input_idx = self.trt_engine.get_binding_index("past_" + self_attention_name)
- output_idx = self.trt_engine.get_binding_index("present_" + self_attention_name)
-
- # Switch generation mode kv cache bindings
- temp = self.bindings[output_idx]
- self.bindings[output_idx] = self.bindings[input_idx]
- self.bindings[input_idx] = temp
- self.use_cache_1_as_input = not self.use_cache_1_as_input
-
- def prepare_inputs_for_generation(self, input_ids, past = None, use_cache = None, **kwargs):
- # TODO: add position_ids, token_type_ids support
- if past is not None:
- input_ids = input_ids[:, -1:]
- self.context_mode = False
- else:
- self.context_mode = self.config.use_cache
-
- return {
- "input_ids": input_ids,
- "past_key_values": past,
- "use_cache": use_cache,
- }
-
- def set_return_device(self, return_device):
- """
- Sets the return device of the return via to(). Device name should be the same as torch devices: cuda, cpu, etc.
- This is used in our measurement code.
- """
- self.return_device = return_device
-
- def _reorder_cache(self, past: Tuple[Tuple[torch.Tensor]], beam_idx: torch.Tensor) -> Tuple[Tuple[torch.Tensor]]:
- """
- This function is used to re-order the :obj:`past_key_values` cache if
- :meth:`~transformers.PreTrainedModel.beam_search` or :meth:`~transformers.PreTrainedModel.beam_sample` is
- called. This is required to match :obj:`past_key_values` with the correct beam_idx at every generation step.
- """
- return tuple(
- tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past)
- for layer_past in past
- )
-
- def _set_context_mode_trt_context(self):
- # Create TRT context for context mode (1st decoder run) with optimization profile = 1
- self.context_trt_context = self.trt_engine.create_execution_context()
- self.context_trt_context.active_optimization_profile = 1
-
-
- def forward(self, input_ids, *args, **kwargs):
- bs = input_ids.shape[0]
- input_length = input_ids.shape[1]
-
- # Check if the input data is on CPU (which usually means the PyTorch does not support current GPU).
- is_cpu_mode = (input_ids.device == torch.device("cpu")) or (self.return_device == "cpu")
-
- if is_cpu_mode:
- input_ids = input_ids.int().cuda()
-
- # Set the binding shape of input_ids, which should be (bs, input_length).
- if not self.context_mode:
- self.bindings[0] = input_ids.int().data_ptr()
- self.trt_context.set_binding_shape(0, input_ids.shape)
- else:
- self.bindings[self.num_bindings] = input_ids.int().data_ptr()
- self.context_trt_context.set_binding_shape(self.num_bindings, input_ids.shape)
-
- if self.config.use_cache:
- if self.context_mode:
- self.past_decoder_length = 0
-
- self_attention_kv_shape = (bs, self.num_heads, self.past_decoder_length, self.embedding_size_per_head)
-
- for i in range(self.num_decoder_layers):
- if not self.context_mode:
- # Optimization Profile 1 is generation phase with no kv inputs
- self.trt_context.set_binding_shape(self.kv_cache_binding_offset+2*i, self_attention_kv_shape)
- self.trt_context.set_binding_shape(self.kv_cache_binding_offset+2*i + 1, self_attention_kv_shape)
- else:
- # Optimization Profile 0 is context phase with kv inputs
- self.context_trt_context.set_binding_shape(self.kv_cache_binding_offset+2*i + self.num_bindings, self_attention_kv_shape)
- self.context_trt_context.set_binding_shape(self.kv_cache_binding_offset+2*i + 1 + self.num_bindings, self_attention_kv_shape)
-
- # Launch TRT inference.
- if not self.context_mode:
- assert self.trt_context.all_binding_shapes_specified
- self.trt_context.execute_v2(bindings=self.bindings)
- else:
- assert self.context_trt_context.all_binding_shapes_specified
- self.context_trt_context.execute_v2(bindings=self.bindings)
-
- # For bs > 1, this is required, so cannnot avoid this D2D copy
- logits_length = bs * input_length * self.config.vocab_size
- logits = self.logits.flatten()[:logits_length].view(bs, input_length, self.config.vocab_size)
-
- if is_cpu_mode:
- logits = logits.cpu()
-
- present_key_values = None
- if self.config.use_cache:
- self.past_decoder_length += input_length
-
- present_key_values = ()
- self_attention_cache = self.self_attention_cache_1 if self.use_cache_1_as_input or (self.profile_idx == 0) else self.self_attention_cache_2
-
- for i in range(self.num_decoder_layers):
-
- self_attention_k_output = self_attention_cache[f"key_values.{i}.decoder.key"]
- self_attention_v_output = self_attention_cache[f"key_values.{i}.decoder.value"]
-
- if is_cpu_mode:
- self_attention_k_output = self_attention_k_output.cpu()
- self_attention_v_output = self_attention_v_output.cpu()
-
- present_key_values += ((self_attention_k_output, self_attention_v_output),)
-
- self._switch_input_output_binding()
- return CausalLMOutputWithPast(logits=logits.to(self.return_device), past_key_values = present_key_values)
-
-class GPT2TRT(TRTInferenceCommand):
- def __init__(self):
- super().__init__(
- GPT2ModelTRTConfig, "Runs polygraphy results for GPT2 model.", GPT2HuggingFace
- )
- self.gpt2_trt = None
-
- def cleanup(
- self,
- workspace: NNFolderWorkspace,
- keep_trt_engine: bool = False,
- keep_onnx_model: bool = False,
- keep_torch_model: bool = False,
- ) -> None:
- # Deactivates context
- if self.gpt2_trt is not None:
- self.gpt2_trt.release()
-
- if not keep_trt_engine:
- self.gpt2_trt_engine.cleanup()
-
- self.frameworks_cmd.cleanup(workspace, keep_onnx_model, keep_torch_model)
-
- def generate(
- self,
- input_ids,
- min_length: int = None,
- max_length: int = None,
- num_beams: int = 1,
- use_cache: bool = False,
- early_stopping: bool = True,
- ):
- if max_length is None:
- max_length = GPT2ModelTRTConfig.MAX_OUTPUT_LENGTH[self.metadata.variant]
-
- if min_length is None:
- min_length = GPT2ModelTRTConfig.MIN_OUTPUT_LENGTH[self.metadata.variant]
-
- output = self.gpt2_trt.generate(
- input_ids,
- max_length=max_length,
- min_length=min_length,
- num_beams=num_beams,
- use_cache=use_cache,
- early_stopping=early_stopping
- )
-
- self.gpt2_trt.reset()
- return output
-
- def execute_inference(
- self,
- metadata: NetworkMetadata,
- onnx_fpaths: Dict[str, NetworkModel],
- inference_input: str,
- timing_profile: TimingProfile,
- batch_size: int = 1,
- num_beams: int = 1,
- benchmarking_mode: bool = False,
- benchmarking_args: GPT2TRTBenchmarkingArgs = None,
- ) -> Union[NetworkResult, BenchmarkingResult]:
-
- tokenizer = GPT2Tokenizer.from_pretrained(metadata.variant)
-
- # GPT2 has no proper token set. Use custom token. Only "generate()" will auto
- # replace with EOS token when using generating mode
- tokenizer.add_special_tokens({"pad_token": "[PAD]"})
- hf_config = self.gpt2_trt.config
-
- # Prepare the input tokens and find out output sequence length.
- if not benchmarking_mode:
- output_seq_len = GPT2ModelTRTConfig.MAX_LENGTH[metadata.variant]
- input_ids = tokenizer([inference_input] * batch_size, return_tensors="pt").input_ids
- else:
- input_seq_len = benchmarking_args.input_seq_len
- output_seq_len = benchmarking_args.output_seq_len
- input_ids = torch.randint(0, hf_config.vocab_size, (batch_size, input_seq_len))
-
- # get single decoder iteration inference timing profile
- _, decoder_e2e_time = gpt2_inference(
- self.gpt2_trt,
- expand_inputs_for_beam_search(input_ids, num_beams) if num_beams > 1 else input_ids,
- timing_profile,
- use_cache = metadata.other.kv_cache,
- )
-
- # get complete decoder inference result and its timing profile
- sample_output, full_e2e_runtime = full_inference(
- self.gpt2_trt,
- input_ids,
- tokenizer,
- timing_profile,
- max_length=output_seq_len,
- min_length=GPT2ModelTRTConfig.MIN_OUTPUT_LENGTH[metadata.variant] if not benchmarking_mode else output_seq_len,
- batch_size=batch_size,
- use_cache=metadata.other.kv_cache,
- num_beams=num_beams,
- )
-
- # Prepare runtime results.
- runtime = [
- NetworkRuntime(
- name=GPT2ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
- runtime=decoder_e2e_time,
- ),
- NetworkRuntime(
- name=GPT2ModelTRTConfig.NETWORK_FULL_NAME,
- runtime=full_e2e_runtime,
- ),
- ]
- models = NetworkModels(
- torch=None,
- onnx=list(onnx_fpaths.values()),
- trt=[
- NetworkModel(
- name=GPT2ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
- fpath=self.gpt2_trt_engine.fpath,
- ),
- ],
- )
-
- # Skip result checking in benchmarking mode since the input data is random.
- if benchmarking_mode:
- return BenchmarkingResult(median_runtime=runtime, models=models)
-
- # Remove the padding and end tokens.
- semantic_outputs = tokenizer.decode(
- sample_output[-1, :], skip_special_tokens=True
- )
-
- if isinstance(semantic_outputs, list):
- semantic_outputs = " ".join(semantic_outputs).strip()
-
- return NetworkResult(
- input=inference_input,
- output_tensor=sample_output,
- semantic_output=semantic_outputs,
- median_runtime=runtime,
- models=models,
- )
-
- def execute_calculate_perplexity(
- self,
- metadata: NetworkMetadata,
- reference: str,
- batch_size: int,
- ):
- tokenizer = GPT2Tokenizer.from_pretrained(metadata.variant)
-
- # GPT2 has no proper token set. Use custom token. Only "generate()" will auto
- # replace with EOS token when using generating mode
- tokenizer.add_special_tokens({"pad_token": "[PAD]"})
- reference = reference.replace("\\n", "\n")
- ppl_input_ids = tokenizer([reference] * batch_size, padding=False, return_tensors="pt").input_ids
-
- perplexity = calculate_perplexity(
- self.gpt2_trt, ppl_input_ids, GPT2ModelTRTConfig.MAX_LENGTH[metadata.variant]
- )
- return perplexity
-
- def _setup_engines(
- self,
- metadata: NetworkMetadata,
- hash_onnx_fpath: Dict[str, NetworkModel],
- batch_size: int,
- num_beams: int,
- disable_preview_dynamic_shapes: bool,
- benchmarking_args: GPT2TRTBenchmarkingArgs = None,
- seq_tag: bool = False, # whether the benchmark engine tag format should be seq or max
- ) -> None:
-
- hf_config = AutoConfig.from_pretrained(
- metadata.variant,
- use_cache=metadata.other.kv_cache
- )
-
- # Output networks shall not exceed number of network segments explicitly defined by configuration file.
- assert len(hash_onnx_fpath) == len(
- GPT2ModelTRTConfig.NETWORK_SEGMENTS
- ), "There should only be {} exported ONNX segments in GPT2 model.".format(
- len(GPT2ModelTRTConfig.NETWORK_SEGMENTS)
- )
-
- decoder_onnx_fpath = hash_onnx_fpath[
- GPT2ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME
- ].fpath
-
- # Generate optimization profiles.
- # non-benchmarking mode: opt profile length is by default half of the max profile
- # benchmarking mode: user can specify opt and max profile by flags. If no additional benchmarking flags are provided, it will just use the non-benchmarking mode defaults
- # Note that this should be set to GPT2's MAX_LENGTH for text generation.
- max_sequence_length = GPT2ModelTRTConfig.MAX_LENGTH[metadata.variant]
- max_output_length = GPT2ModelTRTConfig.MAX_LENGTH[metadata.variant]
- opt_input_seq_len = max_sequence_length // 2
- opt_output_seq_len = max_output_length // 2
-
- # benchmarking flags
- if benchmarking_args is not None:
- max_sequence_length = benchmarking_args.input_profile_max_len
- max_output_length = benchmarking_args.output_profile_max_len
- opt_input_seq_len = benchmarking_args.input_seq_len
- opt_output_seq_len = benchmarking_args.output_seq_len
-
- if not hf_config.use_cache:
- # If not using kv cache, only input_ids is passed
- decoder_profiles = [Profile().add(
- "input_ids",
- min=(batch_size * num_beams, 1),
- opt=(batch_size * num_beams, opt_output_seq_len),
- max=(batch_size * num_beams, max_output_length),
- )]
- else:
- num_heads = hf_config.n_head
- embedding_size_per_head = hf_config.n_embd // num_heads
- num_layers = hf_config.n_layer
-
- # context phase uses the provided input_ids to generate hidden states and self attention kv cache
- # It is only used in the 1st decoder run.
- dec_profiles_context = Profile().add(
- "input_ids",
- min=(batch_size * num_beams, 1),
- opt=(batch_size * num_beams, opt_output_seq_len),
- max=(batch_size * num_beams, max_output_length),
- )
- self_attention_profile_context = {
- "min": (batch_size * num_beams, num_heads, 0, embedding_size_per_head),
- "opt": (batch_size * num_beams, num_heads, 0, embedding_size_per_head),
- "max": (batch_size * num_beams, num_heads, 0, embedding_size_per_head),
- }
-
- # generation phase uses previous self attention kv cache with the last input_ids token to generate the next hidden states and self attention kv cache
- # This optimization profile is used after the 1st decoder run.
- dec_profiles_generation = Profile().add(
- "input_ids",
- min=(batch_size * num_beams, 1),
- opt=(batch_size * num_beams, 1),
- max=(batch_size * num_beams, 1),
- )
-
- self_attention_profile_generation = {
- "min": (batch_size * num_beams, num_heads, 1, embedding_size_per_head),
- "opt": (batch_size * num_beams, num_heads, opt_output_seq_len - 1, embedding_size_per_head),
- "max": (batch_size * num_beams, num_heads, max_output_length - 1, embedding_size_per_head),
- }
-
- for i in range(num_layers):
- dec_profiles_context = dec_profiles_context.add(
- f"past_key_values.{i}.decoder.key",
- **self_attention_profile_context
- ).add(
- f"past_key_values.{i}.decoder.value",
- **self_attention_profile_context
- )
-
- dec_profiles_generation = dec_profiles_generation.add(
- f"past_key_values.{i}.decoder.key",
- **self_attention_profile_generation
- ).add(
- f"past_key_values.{i}.decoder.value",
- **self_attention_profile_generation
- )
-
- # TensorRT accepts multiple optimization engines for the same model.
- # Profile 1 is only used in the first decoder iterations.
- decoder_profiles = [dec_profiles_generation, dec_profiles_context]
-
- # Convert ONNX models to TRT engines.
- if benchmarking_args is None:
- engine_tag = "bs{}".format(batch_size)
- # When user does not input any profile_max_len, use seq as tag, both max are config max
- elif seq_tag:
- engine_tag = "bs{}-inseq{}-outseq{}".format(batch_size, benchmarking_args.input_seq_len, benchmarking_args.output_seq_len)
- # When user input profile_max_len, reuse the engine for future use with different seq_len
- else:
- engine_tag = "bs{}-inmax{}-outmax{}".format(batch_size, benchmarking_args.input_profile_max_len, benchmarking_args.output_profile_max_len)
-
- if num_beams > 1:
- engine_tag += "-beam{}".format(num_beams)
-
- preview_features = [PreviewFeature.DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]
- if disable_preview_dynamic_shapes:
- engine_tag += "-noPreviewFasterDynamicShapes"
- else:
- preview_features.append(PreviewFeature.FASTER_DYNAMIC_SHAPES_0805)
-
- self.gpt2_trt_engine = GPT2ONNXFile(
- decoder_onnx_fpath, metadata
- ).as_trt_engine(
- os.path.splitext(decoder_onnx_fpath)[0] + "-{}.engine".format(engine_tag),
- profiles=decoder_profiles,
- preview_features=preview_features
- )
- self.gpt2_trt = GPT2TRTDecoder(
- self.gpt2_trt_engine, metadata, hf_config, batch_size=batch_size, num_beams=num_beams, benchmarking_args = benchmarking_args
- )
-
- def run_trt(
- self,
- metadata: NetworkMetadata,
- onnx_fpaths: Tuple[NetworkModel],
- network_input: List[str],
- working_directory: str,
- keep_trt_engine: bool,
- keep_onnx_model: bool,
- keep_torch_model: bool,
- timing_profile: TimingProfile,
- batch_size: int = 1,
- args: object = None,
- benchmarking_mode: bool = False,
- disable_preview_dynamic_shapes: bool = False,
- perplexity_reference: List[str] = None,
- ) -> Union[List[NetworkResult], BenchmarkingResult]:
-
- workspace = self._setup_workspace(metadata, working_directory)
-
- # no fpath provided for onnx files, download them
- if len(onnx_fpaths) == 0:
- onnx_fpaths = self._download_models(workspace, metadata)
- else:
- keep_onnx_model = True
- keep_torch_model = True
-
- hash_onnx_fpath = {v.name: v for v in onnx_fpaths}
-
- inference_results = []
- ppl_results = []
- try:
- if not benchmarking_mode:
- self._setup_engines(metadata, hash_onnx_fpath, batch_size, args.num_beams, disable_preview_dynamic_shapes)
- for ninput in network_input:
- inference_results.append(
- self.execute_inference(
- metadata, hash_onnx_fpath, ninput, timing_profile, batch_size, args.num_beams
- )
- )
- # reset the decoder
- self.gpt2_trt.reset()
-
- if perplexity_reference is not None:
- assert len(network_input) == len(perplexity_reference), "Inputs must pair up"
- if metadata.other.kv_cache or (args.num_beams > 1):
- G_LOGGER.warning("Skipping perplexity calculation for TRT with KV cache or beam search because it is not supported yet.")
- else:
- for r in perplexity_reference:
- ppl_results.append(
- self.execute_calculate_perplexity(metadata, r, batch_size)
- )
- else:
- hf_config = AutoConfig.from_pretrained(metadata.variant, use_cache = metadata.other.kv_cache)
- # Check that input_seq_len and output_seq_len is valid and within required range
- max_input_seq_len = hf_config.n_positions
- max_output_seq_len = hf_config.n_positions
-
- seq_tag = args.input_profile_max_len is None and args.output_profile_max_len is None
- # User must provide either a pair of profile_max_len or a profile of seq_len for input/output
- if args.input_profile_max_len is None or args.output_profile_max_len is None:
- if args.input_seq_len is None or args.output_seq_len is None:
- assert False, "Please provide at least one pair of inputs: [input/output]_seq_len or [input/output]_profile_max_len"
-
- input_profile_max_len = setup_benchmark_arg(args.input_profile_max_len, "input_profile_max_len", max_input_seq_len)
- output_profile_max_len = setup_benchmark_arg(args.output_profile_max_len, "output_profile_max_len", max_output_seq_len)
- input_seq_len = setup_benchmark_arg(args.input_seq_len, "input_seq_len", input_profile_max_len // 2)
- output_seq_len = setup_benchmark_arg(args.output_seq_len, "output_seq_len", output_profile_max_len // 2)
-
- benchmarking_args = GPT2TRTBenchmarkingArgs(input_seq_len, output_seq_len, input_profile_max_len, output_profile_max_len)
-
- # Assert to ensure the validity of benchmarking arguments
- assert benchmarking_args.input_seq_len <= benchmarking_args.input_profile_max_len, "input_seq_len should <= input_profile_max_len = {} for benchmarking mode".format(benchmarking_args.input_profile_max_len)
- assert benchmarking_args.output_seq_len <= benchmarking_args.output_profile_max_len, "output_seq_len should <= output_profile_max_len = {} for benchmarking mode".format(benchmarking_args.output_profile_max_len)
- assert benchmarking_args.input_profile_max_len <= max_input_seq_len, "Model config restrict input_profile_max_len <= {} for benchmark mode".format(max_input_seq_len)
- assert benchmarking_args.output_profile_max_len <= max_output_seq_len, "Model config restrict output_profile_max_len <= {} for benchmark mode".format(max_output_seq_len)
- # GPT2 model requires output_seq_len > input_seq_len since it is a text generation model.
- assert benchmarking_args.input_seq_len <= benchmarking_args.output_seq_len, "GPT2 model text generation requires output_seq_len > input_seq_len."
- assert benchmarking_args.input_profile_max_len <= benchmarking_args.output_profile_max_len, "GPT2 model text generation requires output_profile_max_len > input_profile_max_len"
- self._setup_engines(metadata, hash_onnx_fpath, batch_size, args.num_beams, disable_preview_dynamic_shapes, benchmarking_args, seq_tag)
- inference_results = self.execute_inference(
- metadata, hash_onnx_fpath, None, timing_profile, batch_size, args.num_beams, True, benchmarking_args
- )
-
- finally:
- self.cleanup(workspace, keep_trt_engine, keep_onnx_model, keep_torch_model)
-
- return inference_results, ppl_results
-
- def add_args(self, parser) -> None:
- super().add_args(parser)
-
- # use the same args as frameworks.py
- self.frameworks_cmd.add_args(parser)
- polygraphy_group = parser.add_argument_group("polygraphy")
- polygraphy_group.add_argument(
- "--onnx-fpath",
- default=None,
- help="Path to GPT2 ONNX model. If None is supplied, scripts will generate them from HuggingFace.",
- )
- polygraphy_group.add_argument(
- "--fp16", action="store_true", help="Enables fp16 TensorRT tactics."
- )
- polygraphy_group.add_argument(
- "--save-trt-engine",
- action="store_true",
- help="Saves TensorRT runtime engine in working directory.",
- )
-
- def args_to_network_models(self, args) -> List[NetworkModel]:
- gpt2_fpath_check = args.onnx_fpath is None
-
- network_models = None
- if gpt2_fpath_check:
- network_models = tuple()
- else:
- onnx_decoder = NetworkModel(
- name=GPT2ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
- fpath=args.onnx_fpath,
- )
- network_models = (onnx_decoder)
-
- return network_models
-
- def args_to_network_metadata(self, args) -> NetworkMetadata:
- frameworks_parsed_metadata = self.frameworks_cmd.args_to_network_metadata(args)
-
- return NetworkMetadata(
- variant=frameworks_parsed_metadata.variant,
- precision=Precision(fp16=args.fp16),
- other=frameworks_parsed_metadata.other,
- )
-
-
-RUN_CMD = GPT2TRT()
-
-if __name__ == "__main__":
- result = RUN_CMD()
- print("Results: {}".format(result))
diff --git a/demo/HuggingFace/NNDF/README.md b/demo/HuggingFace/NNDF/README.md
deleted file mode 100644
index 8a0cc98c..00000000
--- a/demo/HuggingFace/NNDF/README.md
+++ /dev/null
@@ -1,18 +0,0 @@
-# Neural Network Driven Framework
-
-NNDF are a collection of files and formats that provide an underlying policy and flow for TensorRT network onboarders to follow.
-NNDF is inspired by HuggingFace and PyTorch common design architectures where the Neural Network is divided into two abstractions:
-
-* High level abstractions via configuration files
-* Low level abstractions via I/O classes
-
-## Benefits
-
-Because NNDF is inspired by existing successful network frameworks, interoping and interacting with HuggingFace, Torch, and other
-networks become very trivial and code can often be reused. See for example the `GenerationMixin` which is used in HuggingFace to
-implement `greedy_decoder` and `beam_search`. Using NNDF, we can use `beam_search` and other search functions directly.
-
-In other words:
-
-* Re-use high level measurement tools supplied by well known frameworks
-* Ensure fair platform for timing TRT performance alongside other frameworks by using the same post-processing code.
diff --git a/demo/HuggingFace/NNDF/checkpoints.py b/demo/HuggingFace/NNDF/checkpoints.py
deleted file mode 100644
index 3c94ea32..00000000
--- a/demo/HuggingFace/NNDF/checkpoints.py
+++ /dev/null
@@ -1,155 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Helper file for generating common checkpoints.
-"""
-
-import itertools
-from typing import List
-
-# TRT-HuggingFace
-from NNDF.networks import NetworkMetadata, NetworkResult
-from NNDF.interface import VALID_FRAMEWORKS
-
-# externals
-import toml
-class NNTomlCheckpoint:
- """
- Loads a toml checkpoint file for comparing labels and inputs.
- The following nested key structure is required:
-
- [Network.Framework.Variant.Precision]
-
- For each category, you can assign a default behviour using a special key
- defined by CHECKPOINT_STRUCTURE_FLAT.
-
- CHECKPOINT_STRUCTURE_FLAT cannot be valid in terms of the result that is being added inwards.
- """
-
- # The checkpoint structure and their default keys
- CHECKPOINT_STRUCTURE_FLAT = {
- "framework": "all",
- "variant": "default",
- "precision": "all"
- }
-
- def __init__(self, fpath: str, framework: str, network_name: str, metadata: NetworkMetadata):
- """Loads the toml file for processing."""
- data = {}
- with open(fpath) as f:
- data = toml.load(f)
-
- assert framework in VALID_FRAMEWORKS
- # These keys are reserved to indicate the default state.
- assert self.CHECKPOINT_STRUCTURE_FLAT["framework"] not in VALID_FRAMEWORKS
-
- # Select the current input data
- # try to get the base data
- network_data = data.get(network_name, {})
-
- cur_keys = {
- "framework": framework,
- "variant": metadata.variant,
- "precision": "fp16" if metadata.precision.fp16 else "fp32"
- }
-
- combined_keys =[[self.CHECKPOINT_STRUCTURE_FLAT[k], cur_keys[k]] for k in self.CHECKPOINT_STRUCTURE_FLAT.keys()]
- # A helper function for flattening the getters.
- def flat_getter(d=network_data, *args):
- for k in args:
- if k not in d:
- return {}
- d = d[k]
- return d
-
- # self.data stores several keys:
- # {"checkpoint_name": {"label": xxx, "input": xxx}}
- # The loop below attempts to merge several code snippets together.
- self.data = network_data["all"]["default"]["all"]
- for keys in itertools.product(*combined_keys):
- values = flat_getter(network_data, *keys)
- if len(values) == 0:
- continue
- for data_k, data_v in self.data.items():
- if data_k in values:
- self.data[data_k] = {**data_v, **values[data_k]}
-
- # Used when accuracy() is called
- self._lookup_cache = None
-
- def _iterate_data(self, slice: List[str], skip_keyword: str = "skip"):
- """
- Helper for child classes to iterate through a slice of data.
-
- Return:
- (Union[Dict[str, str], List[str]]): Returns a list of all value keys given in 'slice' or if more than one value is given for 'slice' then a dictionary instead.
- """
- returns_dict = len(slice) > 1
- for value in self.data.values():
- if "skip" in value:
- continue
-
- if returns_dict:
- yield {s: value[s] for s in slice}
- else:
- yield value[slice[0]]
-
-
-class NNSemanticCheckpoint(NNTomlCheckpoint):
- """Requires the following data structure:
-
- [...]
- [input_a]
- label = "sample_label"
- input = "sample_input"
-
- [input_b]
- label = "sample_label"
- input = "sample_input"
-
- Following are reserved keywords:
- = "all" indicates rules apply to all frameworks
- = "default" indicates rules apply to all networks.
- = "all" indicates rules apply to all precisions.
- """
-
- def __iter__(self):
- return self._iterate_data(["label", "input"])
-
- def labels(self):
- return self._iterate_data(["label"])
-
- def inputs(self):
- return self._iterate_data(["input"])
-
- def accuracy(self, results: List[NetworkResult]) -> float:
- # Hash checkpoints by their input
- if self._lookup_cache is None:
- self._lookup_cache = {}
- for k, v in self.data.items():
- self._lookup_cache[v["input"]] = k
-
- correct_count = 0
- for r in results:
- # Find the data the corresponds to input
- key = self._lookup_cache[r.input]
- # remove new line characters
- r_new = r.semantic_output[0] if isinstance(r.semantic_output, list) else r.semantic_output
- correct_count += int(self.data[key]["label"].replace('\\n','').replace('\n','') == r_new.replace('\\n','').replace('\n',''))
-
- return correct_count / len(results)
diff --git a/demo/HuggingFace/NNDF/cuda_bootstrapper.py b/demo/HuggingFace/NNDF/cuda_bootstrapper.py
deleted file mode 100644
index e9fdb26d..00000000
--- a/demo/HuggingFace/NNDF/cuda_bootstrapper.py
+++ /dev/null
@@ -1,101 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Holds logic for modifying and removing invalid CUDA libraries in LD_LIBRARY_PATH.
-
-Users may have CUDA libraries in LD_LIBRARY_PATH which causes issues with Torch cublas.
-This problem only occurs on Linux.
-See:
- https://github.com/pytorch/pytorch/issues/94294
- https://github.com/pytorch/pytorch/issues/64097
-"""
-
-import os
-import sys
-import glob
-import shutil
-
-import subprocess as sp
-from NNDF.logger import G_LOGGER
-
-def bootstrap_ld_library_path() -> bool:
- """
- Modifies the LD_LIBRARY_PATH if applicable and then spawns a child process
- using first "poetry" and then "python3"/"python" if "poetry" fails.
- """
- if os.environ.get("TRT_OSS_DISABLE_BOOTSTRAP") or "linux" not in sys.platform:
- return False
-
- # Walk through each path in environment to see if there are cublas libraries being loaded.
- paths = os.environ.get("LD_LIBRARY_PATH", "").split(os.pathsep)
- new_paths = []
- modified_path = False
- for path in paths:
- for lib in ("cublas", "cudart", "cublasLt"):
- g = glob.glob(os.path.join(path, f"lib{lib}.so.*"))
- if g:
- modified_path = True
- G_LOGGER.warning(f"Discarding `{path}` from LD_LIBRARY_PATH since it contains CUDA libraries.")
- break
- else:
- new_paths.append(path)
-
-
- if not modified_path:
- return False
- else:
- warning_msg = ("Attempting to bootstrap altered LD_LIBRARY_PATH. "
- "\nYou can disable this with TRT_OSS_DISABLE_BOOTSTRAP=1 however frameworks performance may be impacted. "
- "\nThere are known issues with cuBLAS loading and PyTorch compatability "
- "that is still being resolved for most CUDA <= 12.1 and Torch setups. See: "
- "\n - https://github.com/pytorch/pytorch/issues/94294"
- "\n - https://github.com/pytorch/pytorch/issues/64097\n")
- G_LOGGER.warning(warning_msg)
-
- G_LOGGER.info(f"CUDA detected in path. Restarting scripts with modified LD_LIBRARY_PATH: {new_paths}")
- os.environ["LD_LIBRARY_PATH"] = os.pathsep.join(new_paths)
- # To prevent potential recursion, we add one more modification just in case.
- os.environ["TRT_OSS_DISABLE_BOOTSTRAP"] = "1"
-
- # Spawn a new child process instead.
- try:
- # Use the same python exe that invoked this script
- default_python = sys.executable
-
- # Demo supports both poetry and python3 invocation.
- # Check if poetry works first.
- cmd = [default_python] + list(sys.argv)
- if shutil.which("poetry") is not None:
- poetry_cmd = ["poetry", "run"] + cmd
-
- # Poetry command will be tried. If it fails, we ignore the error and fallback to default python.
- try:
- # Instantiate a secondary child process.
- sp.check_call(" ".join(poetry_cmd), env=dict(os.environ), cwd=os.getcwd(), shell=True)
- return True
- except:
- pass
-
- # Default python fallback.
- sp.check_call(" ".join(cmd), env=dict(os.environ), cwd=os.getcwd(), shell=True)
- except Exception as e:
- G_LOGGER.error("Unable to start a new process with modified LD_LIBRARY_PATH. Consider removing CUDA lib in LD_LIBRARY_PATH manually.")
- G_LOGGER.error(str(e))
- G_LOGGER.warning("Attempting to continue with demo.")
-
- return True
diff --git a/demo/HuggingFace/NNDF/general_utils.py b/demo/HuggingFace/NNDF/general_utils.py
deleted file mode 100644
index f8fb9897..00000000
--- a/demo/HuggingFace/NNDF/general_utils.py
+++ /dev/null
@@ -1,286 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""Common utils used by demo folder.
-Note:
-- For now, users/developers that are contributing to TensorRT OSS should NOT import non-default Python packages in this file, because the test pipeline's boot-up process cannot load extra dependencies. In the near future, alternative solutions such as creating a separate boot-up util list can be possible.
-- Users/developers that are just using the TensorRT OSS without contributing are still free to modify this file and customize for deployment.
-"""
-
-import os
-import shutil
-import timeit
-import math
-
-from datetime import datetime
-from shutil import rmtree
-from typing import Callable, Union, List
-from collections import defaultdict
-from statistics import mean, median
-from glob import glob
-
-# NNDF
-from NNDF.networks import NNConfig, NetworkResult, NetworkMetadata, TimingProfile
-from NNDF.logger import G_LOGGER
-
-# Used for HuggingFace setting random seed
-RANDOM_SEED = 42
-
-# Networks #
-def register_network_folders(
- root_dir: str, config_file_str: str = "*Config.py"
-) -> List[str]:
- networks = []
- for network_configs in glob(os.path.join(root_dir, "*", config_file_str)):
- network_name = os.path.split(os.path.split(network_configs)[0])[1]
- networks.append(network_name)
- return networks
-
-
-def process_results(category: List[str], results: List[NetworkResult], nconfig: NNConfig):
- """
- Calculate and process results across multiple runs.
- """
- general_stats = ["script", "accuracy"]
- runtime_result_row_names = list(nconfig.NETWORK_SEGMENTS)
- if nconfig.NETWORK_FULL_NAME not in nconfig.NETWORK_SEGMENTS:
- runtime_result_row_names.append(nconfig.NETWORK_FULL_NAME)
-
- rows = []
- row_entry = []
- for cat, result in zip(category, results):
- # Process runtime results for each group
- runtime_results = defaultdict(list)
- for runtimes in [nr.median_runtime for nr in result.network_results]:
- for runtime in runtimes:
- runtime_results[runtime.name].append(runtime.runtime)
-
- # Calculate average runtime for each group
- average_group_runtime = {k: mean(v) for k, v in runtime_results.items()}
- row_entry = [cat, result.accuracy] + [
- average_group_runtime[n] for n in runtime_result_row_names
- ]
- rows.append(row_entry)
-
- headers = general_stats + [r + " (sec)" for r in runtime_result_row_names]
- return headers, rows
-
-def process_per_result_entries(script_category: List[str], results: List[NetworkResult], max_output_char:int = 30):
- """Prints tabulations for each entry returned by the runtime result."""
- def _shorten_text(w):
- l = len(w)
- if l > max_output_char:
- return w[0:max_output_char // 2] + " ... " + w[-max_output_char//2:]
- return w
-
- headers = ["script", "network_part", "accuracy", "runtime", "input", "output"]
- row_data_by_input = defaultdict(list)
- for cat, result in zip(script_category, results):
- for nr in result.network_results:
- for runtime in nr.median_runtime:
- row_data_by_input[hash(nr.input)].append([
- cat,
- runtime.name,
- result.accuracy,
- runtime.runtime,
- _shorten_text(nr.input),
- _shorten_text(nr.semantic_output)
- ])
-
- return headers, dict(row_data_by_input)
-
-# IO #
-def confirm_folder_delete(
- fpath: str, prompt: str = "Confirm you want to delete entire folder?"
-) -> None:
- """
- Confirms whether or not user wants to delete given folder path.
-
- Args:
- fpath (str): Path to folder.
- prompt (str): Prompt to display
-
- Returns:
- None
- """
- msg = prompt + " {} [Y/n] ".format(fpath)
- confirm = input(msg)
- if confirm == "Y":
- rmtree(fpath)
- else:
- G_LOGGER.info("Skipping file removal.")
-
-
-def remove_if_empty(
- fpath: str,
- success_msg: str = "Folder successfully removed.",
- error_msg: str = "Folder cannot be removed, there are files.",
-) -> None:
- """
- Removes an entire folder if folder is empty. Provides print info statements.
-
- Args:
- fpath: Location to folder
- success_msg: Success message.
- error_msg: Error message.
-
- Returns:
- None
- """
- if len(os.listdir(fpath)) == 0:
- os.rmdir(fpath)
- G_LOGGER.info(success_msg + " {}".format(fpath))
- else:
- G_LOGGER.info(error_msg + " {}".format(fpath))
-
-
-def measure_python_inference_code(
- stmt: Union[Callable, str], timing_profile: TimingProfile
-) -> None:
- """
- Measures the time it takes to run Pythonic inference code.
- Statement given should be the actual model inference like forward() in torch.
-
- Args:
- stmt (Union[Callable, str]): Callable or string for generating numbers.
- timing_profile (TimingProfile): The timing profile settings with the following fields.
- warmup (int): Number of iterations to run as warm-up before actual measurement cycles.
- number (int): Number of times to call function per iteration.
- iterations (int): Number of measurement cycles.
- duration (float): Minimal duration for measurement cycles.
- percentile (int or list of ints): key percentile number(s) for measurement.
- """
-
- def simple_percentile(data, p):
- """
- Temporary replacement for numpy.percentile() because TRT CI/CD pipeline requires additional packages to be added at boot up in this general_utils.py file.
- """
- assert p >= 0 and p <= 100, "Percentile must be between 1 and 99"
-
- rank = len(data) * p / 100
- if rank.is_integer():
- return sorted(data)[int(rank)]
- else:
- return sorted(data)[int(math.ceil(rank)) - 1]
-
- warmup = timing_profile.warmup
- number = timing_profile.number
- iterations = timing_profile.iterations
- duration = timing_profile.duration
- percentile = timing_profile.percentile
-
- G_LOGGER.debug(
- "Measuring inference call with warmup: {} and number: {} and iterations {} and duration {} secs".format(
- warmup, number, iterations, duration
- )
- )
- # Warmup
- warmup_mintime = timeit.repeat(stmt, number=number, repeat=warmup)
- G_LOGGER.debug("Warmup times: {}".format(warmup_mintime))
-
- # Actual measurement cycles
- results = []
- start_time = datetime.now()
- iter_idx = 0
- while iter_idx < iterations or (datetime.now() - start_time).total_seconds() < duration:
- iter_idx += 1
- results.append(timeit.timeit(stmt, number=number))
-
- if isinstance(percentile, int):
- return simple_percentile(results, percentile) / number
- else:
- return [simple_percentile(results, p) / number for p in percentile]
-
-class NNFolderWorkspace:
- """
- For keeping track of workspace folder and for cleaning them up.
- Due to potential corruption of ONNX model conversion, the workspace is split up by model variants.
- """
-
- def __init__(
- self, network_name: str, metadata: NetworkMetadata, working_directory: str
- ):
- self.rootdir = working_directory
- self.metadata = metadata
- self.network_name = network_name
- self.dpath = os.path.join(self.rootdir, self.network_name, metadata.variant)
- os.makedirs(self.dpath, exist_ok=True)
-
- def set_model_path(self, metadata_serialized, is_encoder_decoder: bool) -> str:
- '''
- Create subdirectory for models with different config(e.g. kv cache)
- '''
- self.model_path = os.path.join(self.dpath, metadata_serialized)
- self.decoder_path = os.path.join(self.model_path, "decoder")
- os.makedirs(self.decoder_path, exist_ok=True)
- if is_encoder_decoder:
- self.encoder_path = os.path.join(self.model_path, "encoder")
- os.makedirs(self.encoder_path, exist_ok=True)
- # For decoder only models, there is no encoder
- else:
- self.encoder_path = None
-
- # If is kv cache mode, need to separate non kv mode and kv mode for decoder
- if self.metadata.other.kv_cache:
- self.decoder_non_kv_path = os.path.join(self.decoder_path, "non-kv")
- self.decoder_kv_path = os.path.join(self.decoder_path, "kv")
- os.makedirs(self.decoder_non_kv_path, exist_ok=True)
- os.makedirs(self.decoder_kv_path, exist_ok=True)
-
- return self.model_path, self.encoder_path, self.decoder_path
-
- def get_path(self) -> str:
- return self.dpath
-
- def get_model_path(self) -> str:
- return self.model_path
-
- def get_encoder_path(self) -> str:
- return self.encoder_path
-
- def get_decoder_path(self) -> str:
- return self.decoder_path
-
- def get_decoder_path_kv(self) -> (str, str):
- if not self.metadata.other.kv_cache:
- raise RuntimeError("Trying to access kv specific folder in non kv mode")
- else:
- return self.decoder_kv_path, self.decoder_non_kv_path
-
- def cleanup(self, force_remove: bool = False) -> None:
- '''
- Cleanup would remove all the contents in the workspace.
- '''
- if force_remove:
- return shutil.rmtree(self.dpath)
-
- if self.is_encoder_decoder_path_set:
- if self.encoder_path is not None:
- remove_if_empty(self.encoder_path)
- if self.metadata.other.kv_cache:
- remove_if_empty(
- self.decoder_kv_path
- )
- remove_if_empty(
- self.decoder_non_kv_path
- )
- remove_if_empty(
- self.decoder_path
- )
-
- remove_if_empty(self.model_path)
- remove_if_empty(self.dpath)
diff --git a/demo/HuggingFace/NNDF/interface.py b/demo/HuggingFace/NNDF/interface.py
deleted file mode 100644
index 8d1a739c..00000000
--- a/demo/HuggingFace/NNDF/interface.py
+++ /dev/null
@@ -1,531 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Interface classes required for each registered network script.
-"""
-
-import argparse
-
-from abc import ABCMeta, abstractmethod
-from typing import List, Tuple, Union
-
-# NNDF
-from NNDF.networks import (
- BenchmarkingResult,
- NetworkResult,
- NetworkMetadata,
- NetworkCheckpointResult,
- NNConfig,
- NetworkModel,
- TimingProfile,
-)
-from NNDF.logger import G_LOGGER
-from NNDF.general_utils import NNFolderWorkspace
-
-# externals
-# None, there should be no external dependencies for testing purposes.
-
-# Program-wide constants for passing in valid frameworks.
-FRAMEWORK_NATIVE = "native"
-FRAMEWORK_TENSORRT = "trt"
-FRAMEWORK_ONNXRT = "onnxrt"
-VALID_FRAMEWORKS = [
- FRAMEWORK_NATIVE,
- FRAMEWORK_ONNXRT,
- FRAMEWORK_TENSORRT
-]
-
-class MetadataArgparseInteropMixin:
- """Add argparse support where the class can add new arguments to an argparse object."""
-
- @staticmethod
- @abstractmethod
- def add_args(parser):
- pass
-
- @staticmethod
- @abstractmethod
- def from_args(args):
- pass
-
- @staticmethod
- @abstractmethod
- def add_inference_args(parser):
- pass
-
- @staticmethod
- @abstractmethod
- def from_inference_args(args):
- pass
-
- @staticmethod
- @abstractmethod
- def add_benchmarking_args(parser):
- """
- Add args needed for perf benchmarking mode.
- """
- pass
-
-class NetworkCommand(metaclass=ABCMeta):
- """Base class that each network script's command module should inherit."""
-
- description = "NetworkCommand"
-
- DEFAULT_ITERATIONS = 10
- DEFAULT_NUMBER = 1
- DEFAULT_WARMUP = 3
- DEFAULT_DURATION = 0.0
- DEFAULT_PERCENTILE = 50
-
- def __init__(self, network_config: NNConfig, description: str):
- self.config = network_config()
- self.description = description
- self.framework_name = None
- self._parser = argparse.ArgumentParser(description=description, conflict_handler="resolve")
-
- def __call__(self):
- self.add_args(self._parser)
- self.config.MetadataClass.add_args(self._parser)
- self._args = self._parser.parse_args()
-
- if self._args.verbose:
- G_LOGGER.setLevel(level=G_LOGGER.DEBUG)
- elif self._args.info:
- G_LOGGER.setLevel(level=G_LOGGER.INFO)
-
- self.metadata = self.args_to_network_metadata(self._args)
- self.check_network_metadata_is_supported(self.metadata)
-
- @abstractmethod
- def run_benchmark(self):
- """
- Run inference in performance benchmarking mode for apples-to-apples perf comparisons across platforms.
- Differences with normal run mode include (but are not limited to):
-
- - Use random input data and disable accuracy checking.
- - Use fixed input/output sequence lengths and disable early stopping.
- - Provide better controls on the number of warm-ups and the number/duration of inference iterations.
-
- The derived class should override this method for the benchmarking implementation for the specific framework.
- """
- pass
-
- def add_args(self, parser) -> None:
- general_group = parser.add_argument_group("general")
- general_group.add_argument(
- "--verbose", help="Display verbose logs.", action="store_true"
- )
- general_group.add_argument(
- "--info", help="Display info logs.", action="store_true"
- )
- general_group.add_argument(
- "--cleanup",
- help="Cleans up user-specified workspace. Can not be cleaned if external files exist in workspace.",
- action="store_false",
- )
- general_group.add_argument(
- "--working-dir",
- help="Location of where to save the model and other downloaded files.",
- required=True,
- )
- general_group.add_argument(
- "--batch-size", "-b",
- help="Chosen batch size for given network",
- required=False,
- type=int,
- default=1
- )
-
- timing_group = parser.add_argument_group("inference measurement")
- timing_group.add_argument(
- "--iterations",
- type=int,
- help="Number of iterations to measure.",
- default=self.DEFAULT_ITERATIONS,
- )
- timing_group.add_argument(
- "--number",
- type=int,
- help="Number of actual inference cycles per iterations.",
- default=self.DEFAULT_NUMBER,
- )
- timing_group.add_argument(
- "--warmup",
- type=int,
- help="Number of warmup iterations before actual measurement occurs.",
- default=self.DEFAULT_WARMUP,
- )
- timing_group.add_argument(
- "--duration",
- type=float,
- help="Minimal duration of inference iterations to measure.",
- default=self.DEFAULT_DURATION,
- )
- timing_group.add_argument(
- "--percentile",
- type=int,
- help="Key percentile number for time measurement.",
- default=self.DEFAULT_PERCENTILE,
- )
-
- def check_network_metadata_is_supported(self, metadata: NetworkMetadata) -> None:
- """
- Checks if current command supports the given metadata as defined by the NNConfig.
- Args:
- metadata (NetworkMetadata): NetworkMetadata to check if input is supported.
-
- Throws:
- NotImplementedError: If the given metadata is not a valid configuration for this network.
-
- Returns:
- None
- """
- if metadata not in self.config.variants:
- raise NotImplementedError(
- "The following network config is not yet supported by our scripts: {}".format(
- metadata
- )
- )
-
- def args_to_network_metadata(self, args) -> NetworkMetadata:
- return self.config.MetadataClass.from_args(args)
-
- def load_nn_semantic_checkpoint(self) -> object:
- """
- Loads the NNSemanticCheckpoint instance from checkpoint.toml file.
- """
- # Differ import so that interface file can use used without
- # dependency install for our testing.
- from NNDF.checkpoints import NNSemanticCheckpoint
- checkpoint = NNSemanticCheckpoint(
- "checkpoint.toml",
- framework=self.framework_name,
- network_name=self.config.network_name,
- metadata=self.metadata,
- )
- return checkpoint
-
- def get_timing_profile(self) -> TimingProfile:
- """
- Get TimingProfile settings given current args.
- """
- return TimingProfile(
- iterations=int(self._args.iterations),
- number=int(self._args.number),
- warmup=int(self._args.warmup),
- duration=int(self._args.duration),
- percentile=int(self._args.percentile),
- )
-
-
-class FrameworkCommand(NetworkCommand):
- """Base class that is associated with Frameworks related scripts."""
-
- def __init__(self, network_config: NNConfig, description: str):
- super().__init__(network_config, description)
- self.framework_name = FRAMEWORK_NATIVE
-
- @abstractmethod
- def run_framework(
- self,
- metadata: NetworkMetadata,
- network_input: List[str],
- working_directory: str,
- keep_onnx_model: bool,
- keep_pytorch_model: bool,
- timing_profile: TimingProfile,
- batch_size: int,
- args: object = None,
- benchmarking_mode: bool = False,
- perplexity_reference: List[str] = None,
- ) -> Union[List[NetworkResult], BenchmarkingResult]:
- pass
-
- def __call__(self):
- super().__call__()
-
- checkpoint = self.load_nn_semantic_checkpoint()
-
- network_results, ppl_results = self.run_framework(
- metadata=self.metadata,
- network_input=list(checkpoint.inputs()),
- working_directory=self._args.working_dir,
- keep_onnx_model=self._args.cleanup,
- keep_pytorch_model=self._args.cleanup,
- timing_profile=self.get_timing_profile(),
- use_cpu=self._args.cpu,
- batch_size=self._args.batch_size,
- args=self._args,
- benchmarking_mode=False,
- perplexity_reference=list(checkpoint.labels()),
- )
-
- return NetworkCheckpointResult(
- network_results=network_results,
- accuracy=checkpoint.accuracy(network_results),
- perplexity=(sum(ppl_results) / len(ppl_results) if ppl_results else None),
- )
-
- def run_benchmark(self):
- self.config.MetadataClass.add_benchmarking_args(self._parser)
- super().__call__()
-
- network_results = self.run_framework(
- metadata=self.metadata,
- network_input=None,
- working_directory=self._args.working_dir,
- keep_onnx_model=self._args.cleanup,
- keep_pytorch_model=self._args.cleanup,
- timing_profile=self.get_timing_profile(),
- use_cpu=self._args.cpu,
- batch_size=self._args.batch_size,
- args=self._args,
- benchmarking_mode=True,
- )
-
- return network_results
-
- def add_args(self, parser) -> argparse.ArgumentParser:
- super().add_args(parser)
- device_group = parser.add_argument_group("device")
- device_group.add_argument(
- "--cpu",
- help="Run inference using CPU for frameworks.",
- action="store_true",
- )
-
-class TRTInferenceCommand(NetworkCommand):
- """Base class that is associated with Polygraphy related scripts."""
-
- def __init__(
- self,
- network_config: NNConfig,
- description: str,
- frameworks_cmd: FrameworkCommand,
- ):
- super().__init__(network_config, description)
- self.framework_name = FRAMEWORK_TENSORRT
- # Should be set by
- self.frameworks_cmd = frameworks_cmd()
-
- def _setup_workspace(self, metadata: NetworkMetadata, working_directory: str) -> NNFolderWorkspace:
- return NNFolderWorkspace(
- self.frameworks_cmd.config.network_name, metadata, working_directory
- )
-
- def _download_models(
- self,
- workspace: NNFolderWorkspace,
- metadata: NetworkMetadata,
- ) -> Tuple[NetworkModel]:
- # No fpath provided for onnx files, download them from HuggingFace repo.
- return self.frameworks_cmd.generate_and_download_framework(
- metadata, workspace
- ).onnx
-
- @abstractmethod
- def run_trt(
- self,
- metadata: NetworkMetadata,
- onnx_fpaths: Tuple[NetworkModel],
- network_input: List[str],
- working_directory: str,
- keep_trt_engine: bool,
- keep_onnx_model: bool,
- keep_torch_model: bool,
- timing_profile: TimingProfile,
- batch_size: int = 1,
- args: object = None,
- benchmarking_mode: bool = False,
- disable_preview_dynamic_shapes: bool = False,
- perplexity_reference: List[str] = None,
- ) -> Union[List[NetworkResult], BenchmarkingResult]:
- pass
-
- def __call__(self):
- self.config.MetadataClass.add_inference_args(self._parser)
- super().__call__()
- onnx_fpaths = self.args_to_network_models(self._args)
-
- checkpoint = self.load_nn_semantic_checkpoint()
-
- network_results, ppl_results = self.run_trt(
- metadata=self.metadata,
- onnx_fpaths=onnx_fpaths,
- network_input=list(checkpoint.inputs()),
- working_directory=self._args.working_dir,
- keep_trt_engine=self._args.cleanup,
- keep_onnx_model=self._args.cleanup,
- keep_torch_model=self._args.cleanup,
- timing_profile=self.get_timing_profile(),
- batch_size=self._args.batch_size,
- args=self._args,
- benchmarking_mode=False,
- disable_preview_dynamic_shapes=self._args.disable_preview_dynamic_shapes,
- perplexity_reference=list(checkpoint.labels()),
- )
-
- return NetworkCheckpointResult(
- network_results=network_results,
- accuracy=checkpoint.accuracy(network_results),
- perplexity=(sum(ppl_results) / len(ppl_results) if ppl_results else None),
- )
-
- def run_benchmark(self):
- self.config.MetadataClass.add_inference_args(self._parser)
- self.config.MetadataClass.add_benchmarking_args(self._parser)
- super().__call__()
- onnx_fpaths = self.args_to_network_models(self._args)
-
- network_results = self.run_trt(
- metadata=self.metadata,
- onnx_fpaths=onnx_fpaths,
- network_input=None,
- working_directory=self._args.working_dir,
- keep_trt_engine=self._args.cleanup,
- keep_onnx_model=self._args.cleanup,
- keep_torch_model=self._args.cleanup,
- timing_profile=self.get_timing_profile(),
- batch_size=self._args.batch_size,
- args=self._args,
- benchmarking_mode=True,
- disable_preview_dynamic_shapes=self._args.disable_preview_dynamic_shapes
- )
-
- return network_results
-
- def add_args(self, parser) -> argparse.ArgumentParser:
- super().add_args(parser)
- trt_group = parser.add_argument_group("trt")
- trt_group.add_argument(
- "--disable-preview-dynamic-shapes",
- help="Disable the FASTER_DYNAMIC_SHAPES_0805 preview feature when building the TensorRT engine",
- action="store_true",
- )
-
- trt_benchmarking_group = parser.add_argument_group("trt benchmarking group")
- trt_benchmarking_group.add_argument(
- "--input-profile-max-len",
- type=int,
- help="Specify max input sequence length in TRT engine profile. (default: max supported sequence length)",
- )
- trt_benchmarking_group.add_argument(
- "--output-profile-max-len",
- type=int,
- help="Specify max output sequence length in TRT engine profile. (default: max supported sequence length)",
- )
-
- def args_to_network_metadata(self, args) -> NetworkMetadata:
- return self.config.MetadataClass.from_inference_args(args)
-
- @abstractmethod
- def args_to_network_models(self, args) -> Tuple[NetworkModel]:
- """
- Converts argparse arguments into a list of valid NetworkModel fpaths. Specifically for ONNX.
- Invokes conversion scripts if not.
- Return:
- List[NetworkModel]: List of network model names.
- """
-
-class OnnxRTCommand(NetworkCommand):
- """ONNX Runtime command."""
-
- def __init__(
- self,
- network_config: NNConfig,
- description: str,
- frameworks_cmd: FrameworkCommand,
- ):
- super().__init__(network_config, description)
- self.framework_name = FRAMEWORK_ONNXRT
- # Should be set by
- self.frameworks_cmd = frameworks_cmd()
-
- @abstractmethod
- def run_onnxrt(
- self,
- metadata: NetworkMetadata,
- onnx_fpaths: Tuple[NetworkModel],
- network_input: List[str],
- working_directory: str,
- keep_onnx_model: bool,
- keep_torch_model: bool,
- timing_profile: TimingProfile,
- args: object = None,
- benchmarking_mode: bool = False,
- ) -> Union[List[NetworkResult], BenchmarkingResult]:
- pass
-
- def __call__(self):
- self.config.MetadataClass.add_inference_args(self._parser)
- super().__call__()
- onnx_fpaths = self.args_to_network_models(self._args)
-
- checkpoint = self.load_nn_semantic_checkpoint()
-
- network_results = self.run_onnxrt(
- metadata=self.metadata,
- onnx_fpaths=onnx_fpaths,
- network_input=list(checkpoint.inputs()),
- working_directory=self._args.working_dir,
- keep_onnx_model=self._args.cleanup,
- keep_torch_model=self._args.cleanup,
- timing_profile=self.get_timing_profile(),
- batch_size=self._args.batch_size,
- args=self._args,
- benchmarking_mode=False,
- )
-
- return NetworkCheckpointResult(
- network_results=network_results,
- accuracy=checkpoint.accuracy(network_results),
- perplexity=None,
- )
-
- def run_benchmark(self):
- self.config.MetadataClass.add_inference_args(self._parser)
- self.config.MetadataClass.add_benchmarking_args(self._parser)
- super().__call__()
- onnx_fpaths = self.args_to_network_models(self._args)
-
- network_results = self.run_onnxrt(
- metadata=self.metadata,
- onnx_fpaths=onnx_fpaths,
- network_input=None,
- working_directory=self._args.working_dir,
- keep_onnx_model=self._args.cleanup,
- keep_torch_model=self._args.cleanup,
- timing_profile=self.get_timing_profile(),
- batch_size=self._args.batch_size,
- args=self._args,
- benchmarking_mode=True,
- )
-
- return network_results
-
- def args_to_network_metadata(self, args) -> NetworkMetadata:
- return self.config.MetadataClass.from_inference_args(args)
-
- @abstractmethod
- def args_to_network_models(self, args) -> Tuple[NetworkModel]:
- """
- Converts argparse arguments into a list of valid NetworkModel fpaths. Specifically for ONNX.
- Invokes conversion scripts if not.
- Return:
- List[NetworkModel]: List of network model names.
- """
diff --git a/demo/HuggingFace/NNDF/models.py b/demo/HuggingFace/NNDF/models.py
deleted file mode 100644
index 8a51392b..00000000
--- a/demo/HuggingFace/NNDF/models.py
+++ /dev/null
@@ -1,514 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-File for containing model file abstraction. Useful for generating models.
-"""
-
-import os
-from abc import ABCMeta, abstractmethod
-from typing import Union, List
-from shutil import copytree, rmtree
-
-# polygraphy
-from polygraphy.backend.trt import (
- network_from_onnx_path,
- engine_from_network,
- save_engine,
- Profile,
-)
-
-from polygraphy.backend.trt import CreateConfig
-from polygraphy.logger import G_LOGGER as PG_LOGGER
-
-# torch
-from torch import load, save
-from torch.nn import Module
-
-# tensorrt
-from tensorrt import PreviewFeature, MemoryPoolType
-
-# TRT-HuggingFace
-from NNDF.networks import NetworkMetadata
-from NNDF.logger import G_LOGGER
-
-
-class ModelFileConverter:
- """Abstract class for converting one model format to another."""
-
- def __init__(self, onnx_class, torch_class, trt_engine_class):
- self.onnx_class = onnx_class
- self.torch_class = torch_class
- self.trt_engine_class = trt_engine_class
-
- def torch_to_onnx(
- self, output_fpath: str, model: Module, network_metadata: NetworkMetadata
- ):
- """
- Converts a torch.Model into an ONNX model on disk specified at output_fpath.
-
- Arg:
- output_fpath (str): File location of the generated ONNX file.
- input_fpath (str): Input file location of the generated ONNX file.
- network_metadata (NetworkMetadata): Network metadata of the network being converted.
-
- Returns:
- ONNXModelFile: Newly generated ONNXModelFile
- """
- raise NotImplementedError(
- "Current model does not support exporting to ONNX model."
- )
-
- def onnx_to_torch(
- self, output_fpath: str, input_fpath: str, network_metadata: NetworkMetadata
- ):
- """
- Converts ONNX file into torch.Model which is written to disk.
-
- Arg:
- output_fpath (str): File location of the generated ONNX file.
- input_fpath (str): Input file location of the generated ONNX file.
- network_metadata (NetworkMetadata): Network metadata of the network being converted.
-
- Returns:
- TorchModelFile: Newly generated TorchModelFile
- """
- raise NotImplementedError(
- "Current model does not support exporting to torch model."
- )
-
- def onnx_to_trt(
- self,
- output_fpath: str,
- input_fpath: str,
- network_metadata: NetworkMetadata,
- profiles: List[Profile],
- preview_features: List[PreviewFeature],
- ):
- """
- Converts ONNX file to TRT engine.
- Since TensorRT already supplies converter functions and scripts,
- a default implementation is already provided.
-
- Arg:
- output_fpath (str): File location of the generated ONNX file.
- input_fpath (str): Input file location of the generated ONNX file.
- network_metadata (NetworkMetadata): Network metadata of the network being converted.
- profiles (List[polygraphy.backend.trt.Profile]): The optimization profiles used to build the engine.
- preview_features (List[tensorrt.PreviewFeature]): The preview features to set when building the engine.
-
- Returns:
- TRTEngineFile: Newly generated engine.
- """
- result = self.trt_engine_class(output_fpath, network_metadata)
-
- G_LOGGER.info("Using optimization profiles: {:}".format(profiles))
-
- try:
- self.trt_inference_config = CreateConfig(
- tf32=True,
- fp16=network_metadata.precision.fp16,
- memory_pool_limits = {MemoryPoolType.WORKSPACE: result.max_trt_workspace * 1024 * 1024},
- profiles=profiles,
- precision_constraints=("obey" if result.use_obey_precision_constraints() else None),
- preview_features=preview_features
- )
- except TypeError as e:
- G_LOGGER.error(f"This demo may have an outdated polygraphy. Please see requirements.txt for more details.")
- raise e
-
- if G_LOGGER.level == G_LOGGER.DEBUG:
- g_logger_verbosity = PG_LOGGER.EXTRA_VERBOSE
- elif G_LOGGER.level == G_LOGGER.INFO:
- g_logger_verbosity = PG_LOGGER.INFO
- else:
- g_logger_verbosity = PG_LOGGER.WARNING
-
- with PG_LOGGER.verbosity(g_logger_verbosity):
- network_definition = result.get_network_definition(network_from_onnx_path(input_fpath))
-
- trt_engine = engine_from_network(
- network_definition, config=self.trt_inference_config
- )
- save_engine(trt_engine, output_fpath)
-
- return result
-
-
-class NNModelFile(metaclass=ABCMeta):
- """
- Model abstraction. Allows for loading model as various formats.
- The class assumes models live on the disk in order to reduce complexity of model loading into memory.
- The class guarantees that once export functions are called, models exist on the disk for other
- code to parse or use in other libraries.
- """
-
- def __init__(
- self,
- default_converter: ModelFileConverter = None,
- network_metadata: NetworkMetadata = None,
- ):
- """
- Since torch functions often allow for models to either be from disk as fpath or from a loaded object,
- we provide a similar option here. Arguments can either be a path on disk or from model itself.
-
- Args:
- model (Union[str, torch.Model]): Location of the model as fpath OR loaded torch.Model object.
- """
- if default_converter is not None:
- self.default_converter = default_converter()
- else:
- self.default_converter = NullConverter()
-
- self.network_metadata = network_metadata
-
- def as_torch_model(
- self,
- output_fpath: str,
- converter: ModelFileConverter = None,
- force_overwrite: bool = False,
- ):
- """
- Converts ONNX file into torch.Model which is written to disk.
- Uses provided converter to convert object or default_convert is used instead if available.
-
- Arg:
- output_fpath (str): File location of the generated torch file.
- converter (ModelFileConverter): Class to convert current model instance into another.
- force_overwrite (bool): If the file already exists, tell whether or not to overwrite.
- Since torch models folders, can potentially erase entire folders.
-
- Returns:
- TorchModelFile: Newly generated TorchModelFile
- """
- raise NotImplementedError(
- "Current model does not support exporting to pytorch model."
- )
-
- def as_onnx_model(
- self,
- output_fpath: str,
- converter: ModelFileConverter = None,
- force_overwrite: bool = False,
- ):
- """
- Converts current model into an ONNX model.
- Uses provided converter to convert object or default_convert is used instead if available.
-
- Args:
- output_fpath (str): File location of the generated ONNX file.
- converter (ModelFileConverter): Class to convert current model instance into another.
- force_overwrite (bool): If the file already exists, tell whether or not to overwrite.
- Since torch models folders, can potentially erase entire folders.
-
- Returns:
- ONNXModelFile: Newly generated ONNXModelFile
- """
- raise NotImplementedError(
- "Current model does not support exporting to onnx model."
- )
-
- def as_trt_engine(
- self,
- output_fpath: str,
- converter: ModelFileConverter = None,
- force_overwrite: bool = False,
- profiles: List[Profile] = [],
- preview_features: List[PreviewFeature] = []
- ):
- """
- Converts current model into an TRT engine.
- Uses provided converter to convert object or default_convert is used instead if available.
-
- Args:
- output_fpath (str): File location of the generated ONNX file.
- converter (ModelFileConverter): Class to convert current model instance into another.
- force_overwrite (bool): If the file already exists, tell whether or not to overwrite.
- Since torch models folders, can potentially erase entire folders.
- profiles (List[polygraphy.backend.trt.Profile]): The optimization profiles used to build the engine.
- preview_features (List[tensorrt.PreviewFeature]): The preview features to enable when building the engine.
-
- Returns:
- TRTEngineFile: Newly generated ONNXModelFile
- """
- raise NotImplementedError(
- "Current model does not support exporting to trt engine."
- )
-
- @abstractmethod
- def cleanup(self) -> None:
- """Cleans up any saved models or loaded models from memory."""
-
-
-class TorchModelFile(NNModelFile):
- def __init__(
- self,
- model: Union[str, Module],
- default_converter: ModelFileConverter = None,
- network_metadata: NetworkMetadata = None,
- ):
- """
- Since torch functions often allow for models to either be from disk as fpath or from a loaded object,
- we provide a similar option here. Arguments can either be a path on disk or from model itself.
-
- Args:
- model (Union[str, torch.Model]): Location of the model as fpath OR loaded torch.Model object.
- """
- super().__init__(default_converter, network_metadata)
-
- if isinstance(model, Module):
- self.is_loaded = True
- self.fpath = None
- self.model = model
- else:
- self.is_loaded = False
- self.fpath = model
- self.model = None
-
- def load_model(self) -> Module:
- """
- Loads the model from disk if isn't already loaded.
- Does not attempt to load if given model is already loaded and instead returns original instance.
- Use as_torch_model() instead to always guarantee a new instance and location on disk.
-
- Args:
- None
-
- Returns:
- torch.Model: Loaded torch model.
- """
- if self.is_loaded:
- return self.model
-
- return load(self.fpath)
-
- def as_onnx_model(
- self,
- output_fpath: str,
- converter: ModelFileConverter = None,
- force_overwrite: bool = False,
- ):
- """
- Converts the torch model into an onnx model.
-
- Args:
- output_fpath (str): File location of the generated ONNX file.
- converter (ModelFileConverter): Class to convert current model instance into another.
- force_overwrite (bool): If the file already exists, tell whether or not to overwrite.
- Since torch models folders, can potentially erase entire folders.
- Return:
- (converter.onnx_class): Returns a converted instance of ONNXModelFile.
- """
- converter = self.default_converter if converter is None else converter()
- if not force_overwrite and os.path.exists(output_fpath):
- return converter.onnx_class(output_fpath, self.network_metadata)
-
- return converter.torch_to_onnx(
- output_fpath, self.load_model(), self.network_metadata
- )
-
- def as_torch_model(
- self,
- output_fpath: str,
- converter: ModelFileConverter = None,
- force_overwrite: bool = False,
- ):
- """
- Since the model is already a torch model, forces a save to specified folder and returns new TorchModelFile object from that file location.
-
- Args:
- output_fpath (str): File location of the generated ONNX file.
- converter (ModelFileConverter): Class to convert current model instance into another.
- force_overwrite (bool): If the file already exists, tell whether or not to overwrite.
- Since torch models folders, can potentially erase entire folders.
- Return:
- (converter.torch_class): Returns a converted instance of TorchModelFile.
- """
- converter = self.default_converter if converter is None else converter()
- if not force_overwrite and os.path.exists(output_fpath):
- return converter.torch_class(output_fpath, self.network_metadata)
-
- if self.is_loaded:
- save(self.model, output_fpath)
- else:
- copytree(self.fpath, output_fpath)
-
- return converter.torch_class(output_fpath, self.network_metadata)
-
- def cleanup(self) -> None:
- if self.model:
- G_LOGGER.debug("Freeing model from memory: {}".format(self.model))
- del self.model
-
- if self.fpath:
- G_LOGGER.debug("Removing saved torch model from location: {}".format(self.fpath))
- rmtree(self.fpath)
-
-
-class ONNXModelFile(NNModelFile):
- def __init__(
- self,
- model: str,
- default_converter: ModelFileConverter = None,
- network_metadata: NetworkMetadata = None,
- ):
- """
- Keeps track of ONNX model file. Does not support loading into memory. Only reads and writes to disk.
-
- Args:
- model (str): Location of the model as fpath OR loaded torch.Model object.
- """
- super().__init__(default_converter, network_metadata)
- self.fpath = model
-
- def as_onnx_model(
- self,
- output_fpath: str,
- converter: ModelFileConverter = None,
- force_overwrite: bool = False,
- ):
- """
- Since the model is already a onnx model, forces a save to specified folder and returns new ONNXModelFile object from that file location.
-
- Args:
- output_fpath (str): File location of the generated ONNX file.
- converter (ModelFileConverter): Class to convert current model instance into another.
- force_overwrite (bool): If the file already exists, tell whether or not to overwrite.
-
- Return:
- (converter.onnx_class): Returns a converted instance of ONNXModelFile.
- """
- converter = self.default_converter if converter is None else converter()
- if not force_overwrite and os.path.exists(output_fpath):
- return converter.onnx_class(output_fpath, self.network_metadata)
- else:
- copytree(self.fpath, output_fpath)
-
- return converter.onnx_class(output_fpath, self.network_metadata)
-
- def as_torch_model(
- self,
- output_fpath: str,
- converter: ModelFileConverter = None,
- force_overwrite: bool = False,
- ):
- """
- Converts the onnx model into an torch model.
-
- Args:
- output_fpath (str): File location of the generated ONNX file.
- converter (ModelFileConverter): Class to convert current model instance into another.
- force_overwrite (bool): If the file already exists, tell whether or not to overwrite.
- Since torch models folders, can potentially erase entire folders.
- Return:
- (converter.torch_class): Returns a converted instance of TorchModelFile.
- """
- converter = self.default_converter if converter is None else converter()
- if not force_overwrite and os.path.exists(output_fpath):
- return converter.torch_class(output_fpath, self.network_metadata)
-
- return converter.onnx_to_torch(output_fpath, self.fpath, self.network_metadata)
-
- def _cleanup_onnx_folder(self, folder_dir):
- for d in os.listdir(folder_dir):
- fpath = os.path.join(folder_dir, d)
- # Remove everything related to onnx other than engine
- if (os.path.isfile(fpath)) and (".engine" not in d):
- os.remove(fpath)
-
- def cleanup(self) -> None:
- G_LOGGER.debug("Removing saved ONNX model from location: {}".format(self.fpath))
- if (not self.network_metadata.other.kv_cache) or ("encoder" in self.fpath):
- # Clean up any onnx external files by removing integer named values and weight files
- workspace_path = os.path.split(self.fpath)[0]
- self._cleanup_onnx_folder(workspace_path)
-
- else:
- # In kv cache mode, hard to remove the decoder. Therefore need to search for temporary WAR.
- decoder_path = os.path.split(self.fpath)[0]
- decoder_non_kv_path = os.path.join(decoder_path, "non-kv")
- decoder_kv_path = os.path.join(decoder_path, "kv")
- # Remove kv and nonkv folder correspondingly.
- self._cleanup_onnx_folder(decoder_non_kv_path)
- self._cleanup_onnx_folder(decoder_kv_path)
-
- def as_trt_engine(
- self,
- output_fpath: str,
- converter: ModelFileConverter = None,
- force_overwrite: bool = False,
- profiles = [],
- preview_features = []
- ):
- """
- Converts the onnx model into an trt engine.
-
- Args:
- output_fpath (str): File location of the generated ONNX file.
- converter (ModelFileConverter): Class to convert current model instance into another.
- force_overwrite (bool): If the file already exists, tell whether or not to overwrite.
- Since torch models folders, can potentially erase entire folders.
- profiles (List[polygraphy.backend.trt.Profile]): The optimization profiles used to build the engine.
- preview_features (List[tensorrt.PreviewFeature]): The preview features to set when building the engine.
- Return:
- (converter.trt_engine_class): Returns a converted instance of TRTEngineFile.
- """
- converter = self.default_converter if converter is None else converter()
-
- # TODO: Need to check if the old engine file is compatible with current setting
- if not force_overwrite and os.path.exists(output_fpath):
- return converter.trt_engine_class(output_fpath, self.network_metadata)
-
- return converter.onnx_to_trt(
- output_fpath,
- self.fpath,
- self.network_metadata,
- profiles,
- preview_features
- )
-
-
-class TRTEngineFile(NNModelFile):
-
- @abstractmethod
- def use_obey_precision_constraints(self):
- pass
-
- # get_network_definition can be overloaded to alter the network definition.
- # For example, this function can be used to change the precisions of ops or
- # data type of intermediate tensors.
- def get_network_definition(self, network_definition):
- return network_definition
-
- def __init__(
- self,
- model: str,
- default_converter: ModelFileConverter = None,
- network_metadata: NetworkMetadata = None,
- ):
- super().__init__(default_converter, network_metadata)
- self.fpath = model
- self.max_trt_workspace = 3072
-
- def cleanup(self) -> None:
- G_LOGGER.debug("Removing saved engine model from location: {}".format(self.fpath))
- os.remove(self.fpath)
-
-
-class NullConverter(ModelFileConverter):
- def __init__(self):
- super().__init__(ONNXModelFile, TorchModelFile, TRTEngineFile)
diff --git a/demo/HuggingFace/NNDF/networks.py b/demo/HuggingFace/NNDF/networks.py
deleted file mode 100644
index ff8700fc..00000000
--- a/demo/HuggingFace/NNDF/networks.py
+++ /dev/null
@@ -1,225 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Helpers for abstracting high-level network concepts. Different from 'models.py' which deals
-with IO abstraction.
-"""
-
-import string
-
-from typing import Dict, Union, Tuple
-from collections import namedtuple, OrderedDict
-
-# externals
-# None. Should not have any external dependencies.
-
-FILENAME_VALID_CHARS = "-~_.() {}{}".format(string.ascii_letters, string.digits)
-
-"""NetworkResult(input: str, output_tensor: np.array, semantic_output: np.array, median_runtime: NetworkRuntime, models: [str])"""
-NetworkResult = namedtuple(
- "NetworkResult",
- ["input", "output_tensor", "semantic_output", "median_runtime", "models"],
-)
-
-"""BenchmarkingResult(median_runtime: NetworkRuntime, models: [str])"""
-BenchmarkingResult = namedtuple(
- "BenchmarkingResult",
- ["median_runtime", "models"],
-)
-
-"""CheckpointResult(network_results: List[NetworkResult], accuracy: float, perplexity: float)"""
-NetworkCheckpointResult = namedtuple(
- "NetworkCheckpointResult", ["network_results", "accuracy", "perplexity"]
-)
-
-# Tracks TRT Precision Config
-"""Precision(fp16: Bool)"""
-Precision = namedtuple("Precision", ["fp16"])
-
-"""NetworkMetadata(variant: str, precision: Precision, other: Union[namedtuple, None])"""
-NetworkMetadata = namedtuple("NetworkMetadata", ["variant", "precision", "other"])
-
-"""TimingProfile(iterations: int, number: int, warmup: int, duration: int, percentile: int or [int])"""
-TimingProfile = namedtuple("TimingProfile", ["iterations", "number", "warmup", "duration", "percentile"])
-
-
-"""NetworkModel(name: str, fpath: str)"""
-NetworkModel = namedtuple("NetworkModel", ["name", "fpath"])
-
-"""
-String encodings to genereted network models.
- NetworkModels(torch: Tuple[NetworkModel], onnx: Tuple[NetworkModel])
-"""
-NetworkModels = namedtuple("NetworkModels", ["torch", "onnx", "trt"])
-
-"""
-Args:
- name: Name of the network / parts of the network timed.
- runtime: Runtime of the time.
-
-NetworkRuntime(name: str, runtime: float)
-"""
-NetworkRuntime = namedtuple("NetworkRuntime", ["name", "runtime"])
-
-class Dims:
- """Helper class for interfacing dimension constructs with Polygraphy and PyTorch."""
-
- BATCH = "batch"
- SEQUENCE = "sequence"
-
- def __init__(self, encoding: OrderedDict):
- self.encoding = encoding
-
- def create_new_sequence_dim(dim_type: str) -> str:
- """
- Returns a new sequence dimension.
-
- Return:
- str: Returns a sequence dimension which Dims.SEQUENCE appended by dim_type.
- """
- return Dims.SEQUENCE + "_" + dim_type
-
- def get_dims(self):
- """
- Returns the encoding dimensions.
-
- Return:
- OrderedDict[str, Union[int, str]]: Returns dimensional encoding. Example: {'input_ids': (1, SEQUENCE_DIM)}
- """
- return self.encoding
-
- def get_names(self) -> Tuple[str]:
- return tuple(self.encoding.keys())
-
- def get_lengths(self) -> Tuple[Union[int, str]]:
- return tuple(self.encoding.values())
-
- def get_torch_dynamic_axis_encoding(self) -> dict:
- """
- Returns a Pytorch "dynamic_axes" encoding for onnx.export.
-
- Returns:
- dict: Returns a 'dynamic' index with corresponding names according to:
- https://pytorch.org/docs/stable/onnx.html
- """
-
- dynamic_axes = {}
- for k, v in self.encoding.items():
- encodings = []
- for idx, e in enumerate(v):
- if isinstance(e, str) and (e == self.BATCH or self.SEQUENCE in e):
- encodings.append((idx, e))
- dynamic_axes[k] = {idx: e for idx, e in encodings}
-
- return dynamic_axes
-
-# Config Class
-class NNConfig:
- """Contains info for a given network that we support."""
-
- NETWORK_SEGMENTS = ["full"]
-
- def __init__(self, network_name, variants=None):
- assert self._is_valid_filename(
- network_name
- ), "Network name: {} is not filename friendly.".format(network_name)
-
- self.network_name = network_name
- self.variants = variants
-
- # Due to limitations of namedtuples and pickle function, namedtupled must be tracked as an instance
- # which refers to a global.
- if len(self.variants) > 0:
- self.MetadataClass = type(self.variants[0].other)
- else:
- self.MetadataClass = None
-
- def get_network_segments(self):
- """
- Returns exportable segments for the given network.
- Used in the case where a single network needs to
- be exported into multiple parts.
- """
- return self.NETWORK_SEGMENTS
-
- @staticmethod
- def get_output_dims(metadata) -> Dict:
- """
- Returns the output dimensions of the current network.
- Since some networks can have multiple parts, should be a dictionary encoding.
-
- Returns:
- (Dict): {"network_section": Dims}
- """
- raise NotImplementedError("Output dims not yet defined.")
-
- @staticmethod
- def get_input_dims(metadata) -> Dict:
- """
- Returns the input dimensions of the current network.
- Since some networks can have multiple parts, should be a dictionary encoding.
-
- Returns:
- (Dict): {"network_section": Dims} example:
- {"encoder": Dims(...), "decoder": Dims(...)}
- """
- raise NotImplementedError("Input dims not yet defined.")
-
- def _is_valid_filename(self, filename: str) -> bool:
- """
- Checks if a given filename is valid, helpful for cross platform dependencies.
- """
- return all(c in FILENAME_VALID_CHARS for c in filename)
-
- def get_python_requirements():
- return []
-
- def get_metadata_string(self, metadata: NetworkMetadata) -> str:
- """
- Serializes a Metadata object into string.
- String will be checked if friendly to filenames across Windows and Linux operating systems.
-
- returns:
- string: ---
- """
-
- precision_str = "-".join(
- [k for k, v in metadata.precision._asdict().items() if v]
- )
- result = [self.network_name, metadata.variant]
- if precision_str:
- result.append(precision_str)
-
- other_result = [
- "{}~{}".format(k, str(v)) for k, v in metadata.other._asdict().items()
- ]
- # Remove all boolean values that are False and remove True if exists
- true_length = len("~True")
- other_result_filtered = [v[:-true_length] if v.endswith("~True") else v for v in other_result if "~False" not in v]
-
- if len(other_result_filtered) != 0:
- result.append("-".join(other_result_filtered))
-
- final_str = "-".join(result)
- assert self._is_valid_filename(
- final_str
- ), "Metadata for current network {} is not filename friendly: {}.".format(
- self.network_name, final_str
- )
-
- return final_str
diff --git a/demo/HuggingFace/NNDF/tensorrt_utils.py b/demo/HuggingFace/NNDF/tensorrt_utils.py
deleted file mode 100644
index 74226ae8..00000000
--- a/demo/HuggingFace/NNDF/tensorrt_utils.py
+++ /dev/null
@@ -1,316 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""Utilities related to Polygraphy"""
-
-from typing import Dict, List
-from functools import reduce
-from enum import Enum
-
-# polygraphy
-from polygraphy.backend.trt import engine_from_bytes, TrtRunner
-from polygraphy.backend.onnxrt import OnnxrtRunner, SessionFromOnnx, OnnxrtRunner
-from polygraphy.backend.common import bytes_from_path
-from polygraphy.logger import G_LOGGER as PG_LOGGER
-
-# tensorrt
-import tensorrt as trt
-import os
-
-# ONNX
-import onnx
-import onnx_graphsurgeon as gs
-
-# numpy
-import numpy as np
-
-# NNDF
-from NNDF.networks import NetworkMetadata
-from NNDF.models import TRTEngineFile
-from NNDF.logger import G_LOGGER
-
-# PyTorch
-import torch
-
-# Helper Functions
-def setup_benchmark_arg(user_input, name, default):
- '''
- Set up benchmarking arguments for trt
- '''
- if user_input is None:
- G_LOGGER.warning("{} is not provided, default to {}".format(name, default))
- return default
- return user_input
-
-def allocate_binding_buffer(types_dict, shapes_dict):
- '''
- Allocate binding buffers for trt based on provided types and shapes dict
- '''
- return {
- k: torch.zeros(reduce(lambda v, a: v*a, shape), dtype=types_dict[k]).cuda()
- for k, shape in shapes_dict.items()
- }
-
-
-def set_kv_data(kv_dict, past_or_present, layer_id, segment_value_dict):
- '''
- Set the types and shapes dict for kv-cache based on the provided inputs:
- kv_dict: Dict[str, tuple/torch.dtype], the dict to modify within the function
- past_or_present: str, either "past" or "present"
- layer_id: int, need kv cache for each decoder layer
- segment_value_dict: Dict[str, tuple/torch.dtype], example:
- kvcache type: {"encoder": torch.float32, "decoder": torch.float32}
- kvcache shape: {"encoder": cross_attention_kv_shape, "decoder": self_attention_kv_shape}
- '''
- for segment, value in segment_value_dict.items():
- for code in ['key', 'value']:
- kv_dict[f"{past_or_present}_key_values.{layer_id}.{segment}.{code}"] = value
-
-def clamp_weights_onnx(graph, min: float, max: float, ignore_nodes: List = None):
- """
- Clamps given onnx model to targeted upper and lower bounds.
- """
-
- if ignore_nodes is None:
- ignore_nodes = {}
- else:
- ignore_nodes = {k: True for k in ignore_nodes}
-
- for tensor in graph.tensors().values():
- if tensor.name in ignore_nodes or isinstance(tensor, gs.ir.tensor.Variable):
- continue
-
- np.clip(tensor.values, min, max, out=tensor.values)
-
- for tensor in graph.nodes:
- node_attr = tensor.attrs.get("value", None)
- if tensor.name in ignore_nodes:
- continue
-
- if node_attr is not None:
- np.clip(node_attr.values, min, max, out=node_attr.values)
-
- return graph
-
-
-def clamp_weights_onnx_to_fp16_bounds(graph, ignore_nodes: List = None):
- upper_bound = 65504
- return clamp_weights_onnx(graph, -upper_bound, upper_bound, ignore_nodes)
-
-
-def move_t5_cast_op(graph):
- """
- T5 encoder and decoder have cast ops after residual add operation.
- Moving the cast operation before add helps with FP16 accuracy as addition operation
- can cause overflow in FP16.
- """
-
- cast_nodes = [node for node in graph.nodes if node.op == "Cast"]
- # Version check for backward compatibility
- torch_version_major = int(torch.__version__.split('.')[0])
- torch_version_minor = int(torch.__version__.split('.')[1])
- version_check = torch_version_major == 1 and torch_version_minor > 12
- for n in cast_nodes:
- # Cast appears at the output of add and feeds into a Pow op.
- if n.i().op == "Add":
- found_pow = False
- for o in n.outputs:
- for o1 in o.outputs:
- if o1.op == "Pow":
- found_pow = True
-
- if found_pow:
- if version_check:
- # Using Clip would be the simplest way, but unfortunately TRT refuses to put "Clip" on Myelin. The WAR
- # is to insert a Max followed by a Min instead.
- # Replace the Cast with Max + Min
- n.op = "Max"
- n.name = n.name.replace("Cast", "Max")
- n.attrs = {}
- lower_bound = gs.Constant(n.name + "/lower_bound", np.array(-64000.0, dtype=np.float32))
- n.inputs = [n.inputs[0], lower_bound]
-
- max_node_output = n.outputs[0]
- # Max has already exist, avoid tensors with same names
- max_node_output.name = max_node_output.name.replace("Cast", "ClipMax")
-
- upper_bound = gs.Constant(n.name + "/upper_bound", np.array(64000.0, dtype=np.float32))
- min_node_inputs = [max_node_output, upper_bound]
-
- min_node_output = gs.Variable(max_node_output.name.replace("ClipMax", "ClipMin"), dtype = np.float32)
- min_node = gs.Node(op="Min", inputs = min_node_inputs, outputs = [min_node_output], attrs = {})
- graph.nodes.append(min_node)
-
- for o in max_node_output.outputs:
- # To avoid loop in graph
- if o.op != "Min":
- o.inputs = [min_node_output if i == max_node_output else i for i in o.inputs]
- else:
- n.i().outputs = n.outputs
- n.outputs.clear()
-
- graph.cleanup().toposort()
-
- add_nodes = [node for node in graph.nodes if node.op == "Add"]
- for n in add_nodes:
- if (version_check and (n.o().o().o().op == "Pow")) or ((not version_check) and (n.o().op == "Pow")):
- add_inputs = n.inputs
- outs = []
- for i in add_inputs:
- identity_out = gs.Variable("identity_out" + i.name, dtype=np.float32)
- new_cast = gs.Node(op="Cast", inputs=[i], outputs=[identity_out], attrs={"to": 1})
- outs.append(identity_out)
- graph.nodes.append(new_cast)
- n.inputs = outs
-
- graph.cleanup().toposort()
- return graph
-
-# The current operations would require loading/unloading onnx files twice,
-class OnnxProcessOperation(Enum):
- CLAMP_WEIGHTS = 1
- MOVE_CAST_OP = 2
-
-def process_onnx(config: List[OnnxProcessOperation], onnx_input_fpath, onnx_output_fpath, keep_input = False, **kwargs):
- graph = gs.import_onnx(onnx.load(onnx_input_fpath))
- folder = os.path.split(onnx_input_fpath)[0]
- for op in config:
- if op == OnnxProcessOperation.CLAMP_WEIGHTS:
- graph = clamp_weights_onnx_to_fp16_bounds(graph, **kwargs)
- elif op == OnnxProcessOperation.MOVE_CAST_OP:
- graph = move_t5_cast_op(graph)
-
- model = gs.export_onnx(graph)
- folder = os.path.split(onnx_input_fpath)[0]
- model_size = 0
- for filename in os.listdir(folder):
- file_path = os.path.join(folder, filename)
- try:
- if os.path.isfile(file_path) or os.path.islink(file_path):
- model_size += os.stat(file_path).st_size
- if not keep_input:
- os.unlink(file_path)
-
- except Exception as e:
- print('Failed to delete %s. Reason: %s' % (file_path, e))
-
- # Save the weights as external data only when model > 2GB
- if model_size >= 1.8 * 1024 * 1024 * 1024:
- onnx.save_model(model, onnx_output_fpath, save_as_external_data=True, all_tensors_to_one_file = False, convert_attribute=False)
- else:
- onnx.save_model(model, onnx_output_fpath, save_as_external_data=False)
-
-# Helper Classes
-class TRTNativeRunner:
- """TRTNativeRunner avoids the high overheads with Polygraphy runner providing performance comparable to C++ implementation."""
- def __init__(self, trt_engine_file: TRTEngineFile, network_metadata: NetworkMetadata):
- self.network_metadata = network_metadata
- self.trt_engine_file = trt_engine_file
- self.trt_logger = trt.Logger()
-
- if G_LOGGER.level == G_LOGGER.DEBUG:
- self.trt_logger.min_severity = trt.Logger.VERBOSE
- elif G_LOGGER.level == G_LOGGER.INFO:
- self.trt_logger.min_severity = trt.Logger.INFO
- else:
- self.trt_logger.min_severity = trt.Logger.WARNING
-
- G_LOGGER.info("Reading and loading engine file {} using trt native runner.".format(self.trt_engine_file.fpath))
- with open(self.trt_engine_file.fpath, "rb") as f:
- self.trt_runtime = trt.Runtime(self.trt_logger)
- self.trt_engine = self.trt_runtime.deserialize_cuda_engine(f.read())
- self.trt_context = self.trt_engine.create_execution_context()
-
- # By default set optimization profile to 0
- self.profile_idx = 0
-
- # Other metadata required by the profile
- self._num_bindings_per_profile = self.trt_engine.num_bindings // self.trt_engine.num_optimization_profiles
- G_LOGGER.debug("Number of profiles detected in engine: {}".format(self._num_bindings_per_profile))
-
- def release(self):
- pass
-
- def get_optimization_profile(self, batch_size, sequence_length):
- """Provided helper function to obtain a profile optimization."""
- # Select an optimization profile
- # inspired by demo/BERT/inference.py script
- selected_profile_idx = None
- for idx in range(self.trt_engine.num_optimization_profiles):
- profile_shape = self.trt_engine.get_profile_shape(profile_index=idx, binding=idx * self._num_bindings_per_profile)
-
- if profile_shape[0][0] <= batch_size and profile_shape[2][0] >= batch_size \
- and profile_shape[0][1] <= sequence_length and profile_shape[2][1] >= sequence_length:
- G_LOGGER.debug("Selected profile: {}".format(profile_shape))
- selected_profile_idx = idx
- break
-
- if selected_profile_idx == -1:
- raise RuntimeError("Could not find any profile that matches batch_size={}, sequence_length={}".format(batch_size, sequence_length))
-
- return selected_profile_idx
-
- def __call__(self, *args, **kwargs):
- self.trt_context.active_optimization_profile = self.profile_idx
- return self.forward(*args, **kwargs)
-
-class PolygraphyOnnxRunner:
- def __init__(self, onnx_fpath: str, network_metadata: NetworkMetadata):
- self.network_metadata = network_metadata
- self.trt_session = SessionFromOnnx(onnx_fpath)
- self.trt_context = OnnxrtRunner(self.trt_session)
- self.trt_context.activate()
-
- def __call__(self, *args, **kwargs):
- # hook polygraphy verbosity for inference
- g_logger_verbosity = (
- G_LOGGER.EXTRA_VERBOSE
- if G_LOGGER.root.level == G_LOGGER.DEBUG
- else G_LOGGER.WARNING
- )
- with PG_LOGGER.verbosity(g_logger_verbosity):
- return self.forward(*args, **kwargs)
-
- def release(self):
- self.trt_context.deactivate()
-
-class TRTPolygraphyRunner:
- """
- TRT implemented network interface that can be used to measure inference time.
- Easier to use but harder to utilize. Recommend using TRTNativeRunner for better performance.
- """
-
- def __init__(self, engine_fpath: str, network_metadata: NetworkMetadata):
- self.network_metadata = network_metadata
-
- self.trt_engine = engine_from_bytes(bytes_from_path(engine_fpath))
- self.trt_context = TrtRunner(self.trt_engine.create_execution_context())
- self.trt_context.activate()
-
- def __call__(self, *args, **kwargs):
- # hook polygraphy verbosity for inference
- g_logger_verbosity = (
- G_LOGGER.EXTRA_VERBOSE
- if G_LOGGER.root.level == G_LOGGER.DEBUG
- else G_LOGGER.WARNING
- )
-
- with PG_LOGGER.verbosity(g_logger_verbosity):
- return self.forward(*args, **kwargs)
-
- def release(self):
- self.trt_context.deactivate()
diff --git a/demo/HuggingFace/NNDF/torch_utils.py b/demo/HuggingFace/NNDF/torch_utils.py
deleted file mode 100644
index f3b2fadc..00000000
--- a/demo/HuggingFace/NNDF/torch_utils.py
+++ /dev/null
@@ -1,96 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""Torch utils used by demo folder."""
-
-import inspect
-from typing import Callable
-
-# pytorch
-import torch
-
-# NNDF
-from NNDF.logger import G_LOGGER
-
-# Function Decorators #
-def use_cuda(func: Callable):
- """
- Tries to send all parameters of a given function to cuda device if user supports it.
- Object must have a "to(device: str)" and maps to target device "cuda"
- Basically, uses torch implementation.
-
- Wrapped functions musts have keyword argument "use_cuda: bool" which enables
- or disables toggling of cuda.
- """
-
- def _send_args_to_device(caller_kwargs, device):
- new_kwargs = {}
- for k, v in caller_kwargs.items():
- if getattr(v, "to", False):
- new_kwargs[k] = v.to(device)
- else:
- new_kwargs[k] = v
- return new_kwargs
-
- def wrapper(*args, **kwargs):
- caller_kwargs = inspect.getcallargs(func, *args, **kwargs)
- assert (
- "use_cuda" in caller_kwargs
- ), "Function must have 'use_cuda' as a parameter."
-
- if caller_kwargs["use_cuda"]:
- new_kwargs = {}
- used_cuda = False
- if torch.cuda.is_available() and caller_kwargs["use_cuda"]:
- new_kwargs = _send_args_to_device(caller_kwargs, "cuda")
- used_cuda = True
- else:
- new_kwargs = _send_args_to_device(caller_kwargs, "cpu")
-
- try:
- return func(**new_kwargs)
- except RuntimeError as e:
- # If a device has cuda installed but no compatible kernels, cuda.is_available() will still return True.
- # This exception is necessary to catch remaining incompat errors.
- if used_cuda:
- G_LOGGER.warning("Unable to execute program using cuda compatible device: {}".format(e))
- G_LOGGER.warning("Retrying using CPU only.")
- new_kwargs = _send_args_to_device(caller_kwargs, "cpu")
- new_kwargs["use_cuda"] = False
- cpu_result = func(**new_kwargs)
- G_LOGGER.warning("Successfully obtained result using CPU.")
- return cpu_result
- else:
- raise e
- else:
- return func(**caller_kwargs)
-
- return wrapper
-
-def expand_inputs_for_beam_search(
- tensor,
- expand_size: int = 1,
-):
- """
- Interleave input tensor with `num_beams`, similar to HuggingFace's _expand_inputs_for_generation() in generation_utils.py
- """
- expanded_return_idx = (
- torch.arange(tensor.shape[0]).view(-1, 1).repeat(1, expand_size).view(-1)
- )
- tensor = tensor.index_select(0, expanded_return_idx.to(tensor.device))
-
- return tensor
diff --git a/demo/HuggingFace/README.md b/demo/HuggingFace/README.md
deleted file mode 100644
index cbc8e9c2..00000000
--- a/demo/HuggingFace/README.md
+++ /dev/null
@@ -1,195 +0,0 @@
-# TensorRT Inference for HuggingFace Transformers 🤗
-
-This repository demonstrates TensorRT inference with models developed using [HuggingFace Transformers](https://huggingface.co/transformers/).
-
-Currently, this repository supports the following models:
-
-1. [GPT2 (text generation task)](https://huggingface.co/transformers/model_doc/gpt2.html). The sample supports following variants of GPT2:
-
- gpt2 (117M), gpt2-medium (345M), gpt2-large (774M), gpt2-xl (1558M), EleutherAI/gpt-j-6B (6053M)
-
-2. [T5 (translation, premise task)](https://huggingface.co/transformers/model_doc/t5.html). The sample supports following variants of T5:
-
- t5-small (60M), t5-base (220M), t5-large (770M), t5-3b(3B), t5-11b(11B)
-
-3. [BART (summarization task)](https://huggingface.co/docs/transformers/model_doc/bart.html). The sample supports the following variants of BART:
-
- facebook/bart-base (139M), facebook/bart-large (406M), facebook/bart-large-cnn (406M), facebook/mbart-large-50 (680M)
-
-## Setup
-
-
-Follow the setup steps in the TensorRT OSS repository. It is recommended to experiment inside Docker container.
-For a smoother setup experience, it is recommended to use [Poetry](https://python-poetry.org/) to install requirements and execute:
-
-```bash
-poetry install # one-time setup
-poetry add # see top level repo README.md on how to get TensorRT wheels.
-poetry run python run.py # execute program
-```
-
-However requirements.txt are also provided.
-
-```bash
-pip3 install -r requirements.txt # install requirements
-python run.py # execute program
-```
-
-**Please note that due to end-of-life, Python <= 3.6 is no longer supported.**
-
-## File Structure
-
-```bash
-.
-├── GPT2 # GPT2 directory
-│ └── ...
-├── T5 # T5 directory
-│ └── ...
-├── BART # BART directory
-│ ├── BartModelConfig.py # Model configuration and variant-specific parameters
-│ ├── checkpoint.toml # Example inputs and baseline outputs
-│ ├── export.py # Model conversions between Torch, TRT, ONNX
-│ ├── frameworks.py # PyTorch inference script
-│ ├── onnxrt.py # OnnxRT inference script
-│ ├── trt.py # TensorRT inference script
-│ ├── hf.py # HuggingFace inference script
-│ └── measurements.py # Performance measurement script
-├── NNDF # common high-level abstraction of classes and utilities
-├── notebooks # Jupyter notebooks for GPT2 and T5
-└── run.py # main entry script
-```
-
-## How to run comparison script
-
-`run.py` is the main entry point for the demos. `compare` and `run` are two most common actions to use with `run.py`.
-
-The `compare` action will by default compare all implemented frameworks, e.g., PyTorch frameworks & TRT (for GPT2), PyTorch framework & TRT & OnnxRT (for T5 and BART).
-
-```python
-python3 run.py compare GPT2 --variant [gpt2 | gpt2-medium | gpt2-large | gpt2-xl | EleutherAI/gpt-j-6B] --working-dir temp
-```
-
-The above script compares the performance of PyTorch framework inference and TensorRT inference for GPT2:
-
-| script | accuracy | decoder (sec) | encoder (sec) | full (sec) |
-|------------|----------|---------------|---------------|------------|
-| frameworks | 1 | 0.0292865 | 0.0174382 | 0.122532 |
-| trt | 1 | 0.00494083 | 0.0068982 | 0.0239782 |
-
-Notes: `--variant` designates the pre-trained model for testing. `--working-dir` saves the downloaded pre-trained models, onnx model files, and TRT engine files. accuracy of 1.0 indicates correct results in consistency with the expected outputs in `checkpoint.toml`. By default, all running times reported are median numbers of 10 iterations.
-
-## How to run functional and performance benchmark
-
-The `run` action will run the specific script under the model directory.
-
-```python
-python3 run.py run GPT2 [frameworks | trt] --variant [gpt2 | gpt2-medium | gpt2-large | gpt2-xl | EleutherAI/gpt-j-6B] --working-dir temp
-```
-
-Expected output:
-
-```properties
-NetworkCheckpointResult(network_results=[NetworkResult(
-input='TensorRT is a Deep Learning compiler used for deep learning.\n',
-output_tensor=tensor([ 51, 22854, ....], device='cuda:0'),
-semantic_output=['TensorRT is a Deep Learning compiler used for deep learning.\n\nThe main goal of the project is to create a tool that can be used to train deep learning algorithms.\n\n'],
-median_runtime=[NetworkRuntime(name='gpt2_decoder', runtime=0.002254825085401535), NetworkRuntime(name='full', runtime=0.10705459117889404)],
-models=NetworkModels(torch=None, onnx=[NetworkModel(name='gpt2_decoder', fpath='temp/GPT2/GPT2-gpt2-fp16.onnx')],
-trt=[NetworkModel(name='gpt2_decoder', fpath='temp/GPT2/GPT2-gpt2-fp16.onnx.engine')]))], accuracy=1.0)
-```
-
-## How to run with different precisions in TensorRT
-
-Frameworks (PyTorch) by default run TF32 on Ampere devices and degrade to FP32 on pre-Ampere devices. Accordingly, in TensorRT run, TF32 is also set as the default precision. To experiment with different precisions, use `--fp16` for FP16:
-
-```python
-python3 run.py run BART trt --variant facebook/bart-base --working-dir temp [--fp16]
-```
-
-## How to customize parameters for time measurement
-Use `--iterations`, `--number`, `--warmup`, `--duration`, `--percentile` to control the time measurement process. Most common parameters are explained below:
-* `--iterations `: number of iterations to measure (default 10)
-* `--warmup `: number of warmup iterations before actual measurement occurs (default 3)
-* `--percentile `: key percentile number for measurement (default 50, i.e. median).
-
-```python
-python3 run.py run BART trt --variant facebook/bart-base --working-dir temp --iterations 100 --percentile 99
-```
-
-Notes:
-* Percentile numbers are representative only if the number of iterations are sufficiently large. Please consider increasing `--iterations` when combined with `--percentile`.
-* To avoid conflict with the overall result printing structure, only one percentile number is allowed from command line. If the users need to measure multiple timing statistics from one run (such as p50, p90, p99), please (1) run the command multiple times by changing `--percentile ` -- engines will not be re-built from run to run so this is still efficient OR (2) use the [Jupyter notebook demo](./notebooks) for more flexible measurement that can measurement all percentiles in one run.
-
-## How to run with K-V cache
-
-For all the models (GPT2/BART/T5), use `--enable-kv-cache` option to get the same effect of HuggingFace's `use_cache` option. For encoder-decoder models, this option will use key & value cache in decoder for uni-directional self-attention and encoder-decoder cross-attention. KV cache could reduce the size of `input_ids` and improve runtime performance when `input_ids` is long. Current benchmarking result shows that at `input_seq_len = 1024` and `output_seq_len = 1024`, t5-large model with kv cache could achieve 3x faster than without kv cache in single NVIDIA A100 GPU.
-
-```python
-python3 run.py run BART [frameworks | trt] --variant facebook/bart-base --working-dir temp --enable-kv-cache
-```
-
-Notes:
-* For T5, the code has been optimized according to the latest TensorRT features. (1) Cross attention kv does not change throughout decoding session, so it is only calculated once at the first decoding session. `onnx.export` cannot handle this logic properly for HuggingFace, so we create a "cross attention kv generator" using only `encoder_hidden_states`. (2) TensorRT's "zero tensor" feature is used for self attention kv cache growth starting at empty. (3) Self attention input and output are the same location to avoid D2D copy for kv cache. A similar optimization will be ported to BART.
-
-* For BART, we will be porting similar optimization from T5, but currently, K-V cache decoder with TensorRT requires exporting 2 onnx files and building separate engines respectively, called "non-kv" and "kv". For the first decoder run, KV Cache needs to be generated with only `input_ids` and `encoder_hidden_states`(if encoder_decoder), which is named "non-kv". For the other decoder iterations, previous KV Cache and other inputs are passed into the model to generate the updated KV Cache and decoder_hidden_states, which is named "kv". Because current onnx export cannot handle dynamic number of inputs, 2 onnx files with slightly different configurations are used together.
-
-* For GPT2, since it is decoder only, only self attention kv is needed, and it has 2 mode, corresonding to 2 optimization profiles for a single TensorRT engine: context mode which takes in `input_ids` with various length only and outputs `hidden_states` and self attention cache; generation mode, which takes in `input_ids` with seq_len = 1 and entire self attention kv cache, and outputs `hidden_states` with seq_len = 1 and kv cache with cum_seq_len (`past_decoder_length`) + 1. It has some memory concurrency issue that cannot let self attention input and output point to the same memory location, so it requires dual cache.
-
-## How to run with beam search
-
-In addition to greedy search, beam search is another widely used decoding method. For all the models, use `--num-beams ` to enable beam search during decoding.
-
-```python
-python3 run.py run BART [frameworks | trt] --variant facebook/bart-base --working-dir temp --num-beams 3
-```
-
-Notes:
-* K-V cache with beam search have memory concurrency issues with TensorRT Optimization. We are currently working on this issue.
-
-
-## How to run without the TensorRT `FASTER_DYNAMIC_SHAPES_0805` preview feature
-
-`FASTER_DYNAMIC_SHAPES_0805` significantly improves TensorRT engine build time and is enabled by default in TRT 8.6+. Use the `--disable-preview-dynamic-shapes` option to disable this preview feature for BART, GPT2, or T5. In rare cases, the runtime may increase, so we provide an option to disable it:
-
-```python
-python3 run.py run BART trt --variant facebook/bart-base --working-dir temp --disable-preview-dynamic-shapes
-```
-
-Notes:
-* Preview argument is only for TensorRT runs. Hence, please avoid using `compare` action with `--disable-preview-dynamic-shapes` since the syntax doesn't exist for `frameworks` and `onnxrt` runs. Instead, it is recommended to test TensorRT `run` command seperately to obtain the performance without this preview feature.
-
-## How to run in performance benchmarking mode
-
-The `benchmark` action will benchmark the specific script under the model directory using random input data with specified input/output sequence lengths. Note that since the input data is random, the accuracy is not guaranteed, but the benchmarking mode is useful for performance measurement since it allows arbitrary and controllable input/output sequence lengths with early stopping being disabled and allows apples-to-apples performance comparisons across different frameworks.
-
-```python
-python3 run.py benchmark GPT2 [frameworks | trt] --variant [gpt2 | gpt2-medium | gpt2-large | gpt2-xl | EleutherAI/gpt-j-6B] --working-dir temp --input-seq-len 128 --output-seq-len 256
-```
-
-## How to run in performance benchmarking mode
-
-The `benchmark` action will benchmark the specific script under the model directory using random input data with specified input/output sequence lengths. Note that since the input data is random, the accuracy is not guaranteed, but the benchmarking mode is useful for performance measurement since it allows arbitrary and controllable input/output sequence lengths with early stopping being disabled and allows apples-to-apples performance comparisons across different frameworks.
-
-```python
-python3 run.py benchmark GPT2 [frameworks | trt] --variant [gpt2 | gpt2-large] --working-dir temp --input-seq-len 128 --output-seq-len 256
-```
-
-## Testing
-
-```python
-pytest
-```
-
-It is recommended to use Pytest `4.6.x`. Your Python environment must have already had the setup completed.
-
-
-## Troubleshooting
-
-### cuBLAS Errors
-
-```
-CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
-```
-
-It is possible that your LD_LIBRARY_PATH has a competing CUDA version stored inside, causing PyTorch to read the incorrect library.
-Consider modifying LD_LIBRARY_PATH and removing your CUDA path.
diff --git a/demo/HuggingFace/T5/T5ModelConfig.py b/demo/HuggingFace/T5/T5ModelConfig.py
deleted file mode 100644
index 5490fb4b..00000000
--- a/demo/HuggingFace/T5/T5ModelConfig.py
+++ /dev/null
@@ -1,293 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-import argparse
-
-from collections import namedtuple, OrderedDict
-from itertools import product
-from typing import Dict
-
-# TRT-HuggingFace
-from NNDF.networks import Precision, NetworkMetadata, NNConfig, Dims
-from NNDF.interface import MetadataArgparseInteropMixin
-
-# Limitation of namedtuples. You must declare namedtuples in module scope and not in classes.
-# Otherwise pickle doesn't work.
-# See: https://stackoverflow.com/questions/4677012/python-cant-pickle-type-x-attribute-lookup-failed
-_T5Metadata = namedtuple("T5Metadata", ["kv_cache"])
-
-
-class T5Metadata(_T5Metadata, MetadataArgparseInteropMixin):
- @staticmethod
- def add_args(parser: argparse.ArgumentParser) -> None:
- """Add commandline interface parser."""
- network_group = parser.add_argument_group("T5 network")
- network_group.add_argument(
- "--variant",
- help="T5 variant to generate",
- choices=T5ModelTRTConfig.TARGET_MODELS,
- required=True,
- )
- network_group.add_argument(
- "--enable-kv-cache",
- help="T5 enable KV cache",
- action="store_true",
- default=False,
- )
- network_group.add_argument(
- "--num-beams", type=int, default=1, help="Enables beam search during decoding."
- )
-
- network_group.add_argument(
- "--fp16", action="store_true", help="Enables fp16 TensorRT tactics."
- )
-
- @staticmethod
- def from_args(args: argparse.Namespace):
- return NetworkMetadata(
- variant=args.variant,
- precision=Precision(fp16=args.fp16),
- other=T5Metadata(kv_cache=args.enable_kv_cache),
- )
-
- @staticmethod
- def add_benchmarking_args(parser: argparse.ArgumentParser) -> None:
- benchmarking_group = parser.add_argument_group("benchmarking group")
- benchmarking_group.add_argument(
- "--input-seq-len",
- type=int,
- help="Specify fixed input sequence length for perf benchmarking. Required for benchmark except when both input_profile_max and output_profile_max are provided for trt",
- )
- benchmarking_group.add_argument(
- "--output-seq-len",
- type=int,
- help="Specify fixed output sequence length for perf benchmarking. Required for benchmark except when both input_profile_max and output_profile_max are provided for trt",
- )
-
-T5BenchmarkingArgs = namedtuple("T5BenchmarkingArgs", ["input_seq_len", "output_seq_len"])
-
-# trt has more benchmarking arguments
-T5TRTBenchmarkingArgs = namedtuple("T5TRTBenchmarkingArgs", ["input_seq_len", "output_seq_len", "input_profile_max_len", "output_profile_max_len"])
-
-class T5ModelTRTConfig(NNConfig):
-
- TARGET_MODELS = ["t5-small", "t5-base", "t5-large", "t5-3b", "t5-11b"]
-
- # TensorRT maximum workspace size for each model variant. Set by TensorRT memory_pool_limits API
- MAX_ENCODER_WORKSPACE_MB = {
- TARGET_MODELS[0]: 512,
- TARGET_MODELS[1]: 1024,
- TARGET_MODELS[2]: 2048,
- TARGET_MODELS[3]: 3072,
- TARGET_MODELS[4]: 4096,
- }
-
- MAX_DECODER_WORKSPACE_MB = {
- TARGET_MODELS[0]: 1024,
- TARGET_MODELS[1]: 2048,
- TARGET_MODELS[2]: 3072,
- TARGET_MODELS[3]: 4096,
- TARGET_MODELS[4]: 5120,
- }
-
- MAX_SEQUENCE_LENGTH = {
- TARGET_MODELS[0]: 512,
- TARGET_MODELS[1]: 768,
- TARGET_MODELS[2]: 1024,
- TARGET_MODELS[3]: 1024,
- TARGET_MODELS[4]: 1024,
- }
-
- # To achieve identical results with original HuggingFace implementation, the min_length in model config should be consistent with each model variant
- # see task-specific params in config.json of each variant model
- MIN_OUTPUT_LENGTH = {
- TARGET_MODELS[0]: 0,
- TARGET_MODELS[1]: 0,
- TARGET_MODELS[2]: 0,
- TARGET_MODELS[3]: 0,
- TARGET_MODELS[4]: 0,
- }
-
- #TODO: this might better be an inference time input like the `max_length` arg in generate() and greedy_search(). The change needed is in NNDF/interface.py:__call__ so it's a fundamental change affecting GPT2 and T5 code. Here I just put this option in T5 model config for now. But it's also reasonable to treat this as a model config, because the TRT engine building may need this to have fixed dimension (e.g., to enable KV-cache)
- # see task-specific params in config.json of each variant model
- MAX_OUTPUT_LENGTH = {
- TARGET_MODELS[0]: 512,
- TARGET_MODELS[1]: 768,
- TARGET_MODELS[2]: 1024,
- TARGET_MODELS[3]: 1024,
- TARGET_MODELS[4]: 1024,
- }
-
- # This parameter should be using HuggingFace config, but this file is locked by test and cannot import transformers, so hardcoded here
- NUM_DECODER_LAYERS = {
- TARGET_MODELS[0]: 6,
- TARGET_MODELS[1]: 12,
- TARGET_MODELS[2]: 24,
- TARGET_MODELS[3]: 24,
- TARGET_MODELS[4]: 24,
- }
- NETWORK_FULL_NAME = "full"
- NETWORK_DECODER_SEGMENT_NAME = "decoder"
- NETWORK_ENCODER_SEGMENT_NAME = "encoder"
- NETWORK_SEGMENTS = [NETWORK_DECODER_SEGMENT_NAME, NETWORK_ENCODER_SEGMENT_NAME]
-
- def __init__(self):
- precision_fp16 = [False, True]
- kv_caches = [False, True]
-
- variants = []
- for variant, fp16, kv_cache in product(
- T5ModelTRTConfig.TARGET_MODELS, precision_fp16, kv_caches
- ):
- variants.append(
- NetworkMetadata(
- variant=variant,
- precision=Precision(fp16=fp16),
- other=T5Metadata(kv_cache=kv_cache),
- )
- )
-
- super().__init__("T5", variants=variants)
-
- def get_python_requirements(self):
- base_requirements = super().get_python_requirements()
- base_requirements.append("transformers==4.8.0")
- return base_requirements
-
- def get_network_segments(self):
- """
- Returns exportable segments for the given network.
- Used in the case where a single network needs to
- be exported into multiple parts.
- """
- return T5ModelTRTConfig.NETWORK_SEGMENTS
-
- def get_metadata_string(self, metadata: NetworkMetadata) -> str:
- # Remove redundant t5 name
- metadata = metadata._replace(variant=metadata.variant.lstrip("t5-"))
- return super().get_metadata_string(metadata)
-
- @staticmethod
- def get_input_dims(metadata) -> Dict:
- """
- Returns dictionary encoding of input dimensions.
- Keys will be equal to get_model_segments()
-
- Returns:
- (Dict[str, Dims]): {"decoder": Dims, "encoder": Dims}
- """
- if metadata.other.kv_cache:
- decoder_inputs_dict = OrderedDict(
- {
- "input_ids": (Dims.BATCH, 1),
- "encoder_hidden_states": (
- Dims.BATCH,
- Dims.create_new_sequence_dim("encoder_hidden_length"),
- "encoder_hidden_size"
- ),
- }
- )
- context_inputs_dict = OrderedDict(
- {"encoder_hidden_states": (
- Dims.BATCH,
- Dims.create_new_sequence_dim("encoder_hidden_length"),
- "encoder_hidden_size"
- ),
- }
- )
- # for KV cache version, we need add per-layer KV cache inputs. `past_key_values` at each layer is (self-attention K, self-attention V, cross-attention K, cross-attention V)
- for i in range(T5ModelTRTConfig.NUM_DECODER_LAYERS[metadata.variant]):
- # decoder self-attention KV cache (dim[0] & dim[2] are dynamic, and dim[2] varies at each decoding timestep)
- self_attention_past_kv_dims = (Dims.BATCH, "num_heads", Dims.create_new_sequence_dim("past_decoder_length"), "embedding_size_per_head")
- decoder_inputs_dict[f"past_key_values.{i}.decoder.key"] = self_attention_past_kv_dims
- decoder_inputs_dict[f"past_key_values.{i}.decoder.value"] = self_attention_past_kv_dims
-
- # encoder-decoder cross-attention KV cache (dim[0] & dim[2] are dynamic, but dim[2] is constant at each decoding timestep)
- cross_attention_past_kv_dims = (Dims.BATCH, "num_heads", Dims.create_new_sequence_dim("encoder_length"), "embedding_size_per_head")
- decoder_inputs_dict[f"past_key_values.{i}.encoder.key"] = cross_attention_past_kv_dims
- decoder_inputs_dict[f"past_key_values.{i}.encoder.value"] = cross_attention_past_kv_dims
-
- decoder_inputs = [Dims(context_inputs_dict), Dims(decoder_inputs_dict)]
- else:
- decoder_inputs_dict = OrderedDict(
- {
- "input_ids": (Dims.BATCH, Dims.SEQUENCE),
- "encoder_hidden_states": (
- Dims.BATCH,
- Dims.create_new_sequence_dim("encoder_hidden_length"),
- "encoder_hidden_size"
- ),
- }
- )
- decoder_inputs = Dims(decoder_inputs_dict)
-
- encoder_inputs = Dims(OrderedDict({"input_ids": (Dims.BATCH, Dims.SEQUENCE)}))
-
- return {
- T5ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME: decoder_inputs,
- T5ModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME: encoder_inputs,
- }
-
- @staticmethod
- def get_output_dims(metadata) -> Dict:
- """
- Returns dictionary encoding of output dimensions.
- Keys will be equal to get_model_segments()
-
- Returns:
- (Dict[str, Dims]): {"decoder": Dims, "encoder": Dims}
- """
- if metadata.other.kv_cache:
- decoder_outputs_dict = OrderedDict(
- {"hidden_states": (Dims.BATCH, 1)}
- )
- context_outputs_dict = OrderedDict({})
- # for KV cache version, we need add per-layer KV cache inputs. `past_key_values` at each layer is (self-attention K, self-attention V, cross-attention K, cross-attention V)
- for i in range(T5ModelTRTConfig.NUM_DECODER_LAYERS[metadata.variant]):
- # decoder self-attention KV cache (dim[0] & dim[2] are dynamic, and dim[2] varies at each decoding timestep)
- self_attention_present_kv_dims = (Dims.BATCH, "num_heads", Dims.create_new_sequence_dim("past_decoder_length"), "embedding_size_per_head")
- decoder_outputs_dict[f"present_key_values.{i}.decoder.key"] = self_attention_present_kv_dims
- decoder_outputs_dict[f"present_key_values.{i}.decoder.value"] = self_attention_present_kv_dims
-
- # encoder-decoder cross-attention KV cache (dim[0] & dim[2] are dynamic, but dim[2] is constant at each decoding timestep)
- cross_attention_present_kv_dims = (Dims.BATCH, "num_heads", Dims.create_new_sequence_dim("encoder_length"), "embedding_size_per_head")
- context_outputs_dict[f"present_key_values.{i}.encoder.key"] = cross_attention_present_kv_dims
- context_outputs_dict[f"present_key_values.{i}.encoder.value"] = cross_attention_present_kv_dims
-
- decoder_outputs = [Dims(context_outputs_dict), Dims(decoder_outputs_dict)]
- else:
- decoder_outputs_dict = OrderedDict(
- {"hidden_states": (Dims.BATCH, Dims.SEQUENCE)}
- )
- decoder_outputs = Dims(decoder_outputs_dict)
-
- encoder_outputs = Dims(
- OrderedDict(
- {
- "hidden_states": (
- Dims.BATCH,
- Dims.SEQUENCE,
- "encoder_hidden_size"
- )
- }
- )
- )
-
- return {
- T5ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME: decoder_outputs,
- T5ModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME: encoder_outputs,
- }
diff --git a/demo/HuggingFace/T5/checkpoint.toml b/demo/HuggingFace/T5/checkpoint.toml
deleted file mode 100644
index 4dd7b134..00000000
--- a/demo/HuggingFace/T5/checkpoint.toml
+++ /dev/null
@@ -1,47 +0,0 @@
-# Default requirements
-[T5.all.default.all.premise_a]
-
-input = '''
-premise: I hate pigeons. hypothesis: My feelings towards pigeons are filled with animosity.
-'''
-
-label = "entailment"
-
-
-[T5.all.default.all.translate_a]
-
-input = '''
-translate English to German: That is good.
-'''
-
-label = "Das ist gut."
-
-[T5.all.default.all.cola_a]
-
-input = '''
-cola sentence: All your base are belong to us.
-'''
-
-label = "unacceptable"
-
-[T5.all.default.all.premise_b]
-
-input = '''
-premise: If I fall asleep then I am going to wake up in 8 hours. hypothesis: I fell asleep but did not wake up in 8 hours.
-'''
-
-label = "contradiction"
-
-# t5-small is gets some results differently
-[T5.all.t5-small.all.premise_a]
-
-label = "contradiction"
-
-[T5.all.t5-small.all.cola_a]
-
-label = "acceptable"
-
-# t5-base also gets results differently
-[T5.all.t5-base.all.translate_a]
-
-label = "Das ist gut so."
diff --git a/demo/HuggingFace/T5/export.py b/demo/HuggingFace/T5/export.py
deleted file mode 100644
index 63b7a73e..00000000
--- a/demo/HuggingFace/T5/export.py
+++ /dev/null
@@ -1,483 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Contains logic that captures T5 HuggingFace models into ONNX models.
-Inspired by https://github.com/onnx/models/blob/master/text/machine_comprehension/t5/dependencies/T5-export.py
-"""
-
-from typing import List
-
-from json import encoder
-import os
-from collections import OrderedDict
-
-# tensorrt
-import tensorrt as trt
-from tensorrt import PreviewFeature
-
-# polygraphy
-from polygraphy.backend.trt import Profile
-
-# torch
-import torch
-from torch.nn import Module
-
-# huggingface
-from transformers.generation_utils import GenerationMixin
-from transformers.modeling_outputs import Seq2SeqLMOutput
-from transformers import T5ForConditionalGeneration
-
-# TRT-HuggingFace
-from T5.T5ModelConfig import T5ModelTRTConfig
-from NNDF.tensorrt_utils import OnnxProcessOperation, process_onnx
-from NNDF.networks import NetworkMetadata, Precision, Dims
-from NNDF.logger import G_LOGGER
-from NNDF.models import (
- TRTEngineFile,
- TorchModelFile,
- ONNXModelFile,
- ModelFileConverter,
-)
-
-def add_extra_fp32(network_definition):
- """
- Force operations involved in layer norm to run in FP32 precision.
- """
- pow_ops = {}
- for layer_index, layer in enumerate(network_definition[1]):
- if layer.type == trt.LayerType.IDENTITY:
- all_fp32 = all([layer.output_type_is_set(o) and layer.get_output_type(o) == trt.float32 for o in range(layer.num_outputs)])
- if all_fp32:
- if layer.get_input(0).dtype == trt.float32:
- layer.precision = trt.float32
-
- if layer.type == trt.LayerType.ELEMENTWISE:
- layer.__class__ = getattr(trt, "IElementWiseLayer")
- if layer.op == trt.ElementWiseOperation.POW:
- pow_ops[layer] = layer_index
- layer.precision = trt.float32
- layer.set_output_type(0, trt.float32)
-
- for _, index in pow_ops.items():
- # Iterate from few layers before pow to include residual add and cast op.
- # Iterate till 10 layers after pow op to include all operations included in layer norm.
- START_OFFSET = 4
- END_OFFSET = 12
- for i in range(index-START_OFFSET, index+END_OFFSET):
- l = network_definition[1].get_layer(i)
- if l.type == trt.LayerType.REDUCE:
- l.precision = trt.float32
- l.set_output_type(0, trt.float32)
-
- if l.type == trt.LayerType.ELEMENTWISE:
- l.__class__ = getattr(trt, "IElementWiseLayer")
- if l.op == trt.ElementWiseOperation.SUM:
- l.precision = trt.float32
- l.set_output_type(0, trt.float32)
-
- if l.type == trt.LayerType.UNARY:
- l.__class__ = getattr(trt, "IUnaryLayer")
- if l.op == trt.UnaryOperation.SQRT:
- l.precision = trt.float32
- l.set_output_type(0, trt.float32)
-
- if l.type == trt.LayerType.ELEMENTWISE:
- l.__class__ = getattr(trt, "IElementWiseLayer")
- if l.op == trt.ElementWiseOperation.DIV:
- l.precision = trt.float32
- l.set_output_type(0, trt.float32)
-
- if l.type == trt.LayerType.ELEMENTWISE:
- l.__class__ = getattr(trt, "IElementWiseLayer")
- if l.op == trt.ElementWiseOperation.PROD:
- l.precision = trt.float32
- l.set_output_type(0, trt.float32)
-
- return network_definition
-
-# Torch File Encoding #
-class T5DecoderTorchFile(TorchModelFile):
- class TorchModule(Module, GenerationMixin):
- """
- A simplied definition of T5 Decoder without support for loss.
- Decoder with lm-head attached.
- """
-
- def __init__(self, decoder, lm_head, config, is_trt = False):
- super().__init__()
- self.decoder = decoder
- self.lm_head = lm_head
- self.config = config
- # HuggingFace's beam search requires to set self.device. Set it to avoid application crash
- self.device = torch.device('cuda')
- # Use hardcoded value to extend compatibility with older HF versions.
- self.main_input_name = "input_ids"
- # trt uses cached and precomputed cross attention vs. framework uses the entire kv cache as output. Need to treat them differently.
- self.is_trt = is_trt
-
- def prepare_inputs_for_generation(
- self,
- input_ids,
- past=None,
- use_cache=None,
- **kwargs
- ):
- # cut decoder_input_ids if past is used
- if past is not None:
- input_ids = input_ids[:, -1:]
-
- return {
- "input_ids": input_ids,
- "encoder_hidden_states": kwargs["encoder_outputs"].last_hidden_state,
- "use_cache": use_cache,
- "past_key_values": past
- }
-
- def forward(
- self,
- input_ids,
- encoder_hidden_states,
- use_cache = None,
- past_key_values = None,
- return_dict = None,
- **kwargs,
- ):
- # self.decoder is the HuggingFace t5 decoder
- decoder_outputs = self.decoder(
- input_ids=input_ids,
- encoder_hidden_states=encoder_hidden_states,
- use_cache=use_cache,
- past_key_values=past_key_values,
- return_dict=return_dict,
- **kwargs
- )
-
- # self.config.d_model ** -0.5 for rescaling output on vocab.
- # as seen in https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5ForConditionalGeneration
- sequence_output = decoder_outputs[0] * self.config.d_model ** -0.5
- logits = self.lm_head(sequence_output)
- if use_cache:
- if self.is_trt:
- past_key_values = ()
- past_key_values_output = decoder_outputs[1]
- for layer_past_states in past_key_values_output:
- past_key_values = past_key_values + (layer_past_states[:2],)
- else:
- past_key_values = decoder_outputs[1]
-
- if not return_dict:
- return (logits, past_key_values)
-
- return Seq2SeqLMOutput(
- logits=logits,
- past_key_values=past_key_values
- )
-
- def __init__(self, model, network_metadata):
- super().__init__(model, T5DecoderConverter, network_metadata)
-
-class T5DecoderCrossAttentionKVGenerator(Module):
- def __init__(self, decoder, device = "cpu"):
- super().__init__()
- self.decoder = decoder
- self.device = device
-
- def forward(self, encoder_hidden_states):
- '''
- Use same but simplified code as HF modeling_t5.py to generate cross attention kv cache from provided encoder_hidden_states
- '''
- present_key_values = ()
- for layer_module in self.decoder.block:
- # hidden_states and position_bias are required for the forward call, but irrelevant of cross attention kv cache calculation, so generate dummy variables
- dummy_hidden_states = torch.zeros(1,1).to(self.device)
- dummy_position_bias = torch.zeros(1, layer_module.layer[1].EncDecAttention.n_heads, 1, encoder_hidden_states.shape[1]).to(self.device)
- cross_attention_outputs = layer_module.layer[1](
- hidden_states=dummy_hidden_states,
- key_value_states=encoder_hidden_states,
- use_cache=True,
- past_key_value=None,
- position_bias=dummy_position_bias
- )
- present_key_values = present_key_values + cross_attention_outputs[1]
-
- return present_key_values
-
- def __call__(self, *args, **kwargs):
- return self.forward(*args, **kwargs)
-
-class T5EncoderTorchFile(TorchModelFile):
- """Creation of a class to output only the last hidden state from the encoder."""
-
- class TorchModule(Module, GenerationMixin):
- def __init__(self, encoder):
- super().__init__()
- self.encoder = encoder
- # Use hardcoded value to extend compatibility with older HF versions.
- self.main_input_name = "input_ids"
-
- def forward(self, *input, **kwargs):
- return self.encoder(*input, **kwargs)[0]
-
- def __call__(self, *args, **kwargs):
- return self.forward(*args, **kwargs)
-
- def __init__(self, model, network_metadata):
- super().__init__(model, T5EncoderConverter, network_metadata)
-
-
-# ONNX File Encoding #
-class T5EncoderONNXFile(ONNXModelFile):
- def __init__(self, model, network_metadata):
- super().__init__(model, T5EncoderConverter, network_metadata)
-
-
-class T5DecoderONNXFile(ONNXModelFile):
- def __init__(self, model, network_metadata):
- super().__init__(model, T5DecoderConverter, network_metadata)
-
-
-# TRT Engine File Encoding #
-class T5DecoderTRTEngine(TRTEngineFile):
-
- def __init__(self, model, network_metadata):
- super().__init__(model, T5DecoderConverter, network_metadata)
- self.max_trt_workspace = T5ModelTRTConfig.MAX_DECODER_WORKSPACE_MB[network_metadata.variant]
-
-
- def get_network_definition(self, network_definition):
- if self.network_metadata.precision.fp16:
- for i in range(network_definition[1].num_inputs):
- t = network_definition[1].get_input(i)
- if t.dtype == trt.float32:
- t.dtype = trt.float16
-
- for i in range(network_definition[1].num_outputs):
- t = network_definition[1].get_output(i)
- if t.dtype == trt.float32:
- t.dtype = trt.float16
-
- return add_extra_fp32(network_definition)
-
- def use_obey_precision_constraints(self):
- return self.network_metadata.precision.fp16
-
-
-class T5EncoderTRTEngine(TRTEngineFile):
-
- def __init__(self, model, network_metadata):
- super().__init__(model, T5EncoderConverter, network_metadata)
- self.max_trt_workspace = T5ModelTRTConfig.MAX_ENCODER_WORKSPACE_MB[network_metadata.variant]
-
- def get_network_definition(self, network_definition):
- return add_extra_fp32(network_definition)
-
- def use_obey_precision_constraints(self):
- return self.network_metadata.precision.fp16
-
-# Converters #
-class T5DecoderConverter(ModelFileConverter):
- def __init__(self):
- super().__init__(T5DecoderTorchFile, T5DecoderONNXFile, T5DecoderTRTEngine)
-
- def torch_to_onnx(
- self, output_fpath: str, model: Module, network_metadata: NetworkMetadata
- ):
- """
- Exports a given huggingface T5 to decoder architecture only.
- Inspired by https://github.com/onnx/models/blob/master/text/machine_comprehension/t5/dependencies/T5-export.py
-
- Args:
- output_prefix (str): Path to the onnx file
- model (torch.Model): Model loaded torch class
-
- Returns:
- T5DecoderONNXFile: ONNX decoder object.
- """
- # TODO: CPU and GPU PyTorch models may use different operations and might perform differently.
- # Adding a device parameter to the class may help
- device = model.device
- input_ids = torch.tensor([[42] * 10]).to(device)
- # Exporting the decoder requires a basic instance of the encoder
- # Create one temporarily
- simplified_encoder = T5EncoderTorchFile.TorchModule(model.encoder)
- # Exports to ONNX
- decoder_with_lm_head = T5DecoderTorchFile.TorchModule(
- model.decoder, model.lm_head, model.config, is_trt = True
- )
-
- inputs = T5ModelTRTConfig.get_input_dims(network_metadata)["decoder"]
- outputs = T5ModelTRTConfig.get_output_dims(network_metadata)["decoder"]
-
- # Exports to ONNX
- opt_args={}
-
- version_major = int((torch.__version__).split('.')[0])
- version_minor = int((torch.__version__).split('.')[1])
- if version_major < 1 or (version_major == 1 and version_minor < 11):
- opt_args['use_external_data_format'] = True
-
- if not network_metadata.other.kv_cache:
- # This code allows for huggingface compatible torch class to use onnx exporter
- old_forward = decoder_with_lm_head.forward
- def _export_forward(input_ids, encoder_hidden_states, **kwargs):
- result = old_forward(input_ids, encoder_hidden_states, use_cache=False, **kwargs)
- return result[0]
- decoder_with_lm_head.forward = _export_forward
-
- torch.onnx.export(
- decoder_with_lm_head,
- (input_ids, simplified_encoder(input_ids)),
- output_fpath,
- do_constant_folding=True,
- opset_version=13,
- input_names=inputs.get_names(),
- output_names=outputs.get_names(),
- dynamic_axes={
- **inputs.get_torch_dynamic_axis_encoding(),
- **outputs.get_torch_dynamic_axis_encoding(),
- },
- training=torch.onnx.TrainingMode.EVAL,
- **opt_args
- )
- else:
- encoder_hidden_states = simplified_encoder(input_ids).to(device)
- kv_decoder_input_ids = input_ids[:,-1:].to(device)
- decoder_output = decoder_with_lm_head.decoder(input_ids=kv_decoder_input_ids, encoder_hidden_states=encoder_hidden_states, use_cache=True, past_key_values=None) # decoder output at t-1 step (logits, past_key_values from 0 to t-1)
- past_key_values = decoder_output[1]
- # This code allows for huggingface compatible torch class to use onnx exporter (change just before onnx.export)
- old_forward = decoder_with_lm_head.forward
- def _export_forward(input_ids, encoder_hidden_states, past_key_values):
- result = old_forward(input_ids, encoder_hidden_states, past_key_values=past_key_values, use_cache=True)
- return result
- decoder_with_lm_head.forward = _export_forward
-
- torch.onnx.export(
- decoder_with_lm_head,
- (kv_decoder_input_ids, encoder_hidden_states, past_key_values),
- output_fpath,
- do_constant_folding=True,
- opset_version=13,
- input_names=inputs[1].get_names(),
- output_names=outputs[1].get_names(),
- dynamic_axes={
- **inputs[1].get_torch_dynamic_axis_encoding(),
- **outputs[1].get_torch_dynamic_axis_encoding(),
- },
- training=torch.onnx.TrainingMode.EVAL,
- **opt_args
- )
-
- cross_attention_kv_generator = T5DecoderCrossAttentionKVGenerator(decoder_with_lm_head.decoder, device)
- decoder_folder, decoder_name = os.path.split(output_fpath)
- decoder_name, decoder_ext = os.path.splitext(decoder_name)
- output_fpath_kv_generator_folder = os.path.join(decoder_folder, "cross_attention_kv_generator")
- os.makedirs(output_fpath_kv_generator_folder, exist_ok = True)
- output_fpath_kv_generator = os.path.join(output_fpath_kv_generator_folder, decoder_name + "-cross_attention_kv_generator" + decoder_ext)
- torch.onnx.export(
- cross_attention_kv_generator,
- (encoder_hidden_states),
- output_fpath_kv_generator,
- do_constant_folding=True,
- opset_version=13,
- input_names=inputs[0].get_names(),
- output_names=outputs[0].get_names(),
- dynamic_axes={
- **inputs[0].get_torch_dynamic_axis_encoding(),
- **outputs[0].get_torch_dynamic_axis_encoding(),
- },
- training=torch.onnx.TrainingMode.EVAL,
- **opt_args
- )
-
- if network_metadata.precision.fp16:
- process_onnx([OnnxProcessOperation.CLAMP_WEIGHTS], output_fpath_kv_generator, output_fpath_kv_generator)
-
- if network_metadata.precision.fp16:
- process_onnx([OnnxProcessOperation.MOVE_CAST_OP, OnnxProcessOperation.CLAMP_WEIGHTS], output_fpath, output_fpath)
-
- return T5DecoderONNXFile(output_fpath, network_metadata)
-
-
-class T5EncoderConverter(ModelFileConverter):
- def __init__(self):
- super().__init__(T5EncoderTorchFile, T5EncoderONNXFile, T5EncoderTRTEngine)
-
- def onnx_to_trt(
- self, output_fpath: str, input_fpath: str, network_metadata: NetworkMetadata, profiles: List[Profile], preview_features: List[PreviewFeature]
- ):
- """
- Override onnx_to_trt function from base.
- Workaround: model larger than t5-small are too large and cause FP16 to overflow. Encoder should not use FP16 tactics even in FP16 mode.
- The perf decreases by less than 10% end-to-end. Usage with TRT is still substantial compared to frameworks.
- """
- # Force encoder to FP32 only if variants are anything larger than small
- # because of overflow and underflow issues
- if network_metadata.precision.fp16 and network_metadata.variant != "t5-small":
- network_metadata_cp_dct = network_metadata._asdict()
- del network_metadata_cp_dct["precision"]
- network_metadata = NetworkMetadata(**network_metadata_cp_dct, precision=Precision(fp16=False))
-
- return super().onnx_to_trt(output_fpath, input_fpath, network_metadata, profiles, preview_features)
-
- def torch_to_onnx(
- self, output_fpath: str, model: Module, network_metadata: NetworkMetadata
- ):
- """
- Exports a given huggingface T5 to encoder architecture only.
- Inspired by https://github.com/onnx/models/blob/master/text/machine_comprehension/t5/dependencies/T5-export.py
-
- Args:
- output_prefix (str): Path to the onnx file
- model (torch.Model): Model loaded torch class
-
- Returns:
- Tuple[str]: Names of generated models
- """
- device = model.device
- input_ids = torch.tensor([[42] * 10]).to(device)
- simplified_encoder = T5EncoderTorchFile.TorchModule(model.encoder)
- inputs = T5ModelTRTConfig.get_input_dims(network_metadata)["encoder"]
- outputs = T5ModelTRTConfig.get_output_dims(network_metadata)["encoder"]
-
- # Exports to ONNX
- opt_args={}
-
- version_major = int((torch.__version__).split('.')[0])
- version_minor = int((torch.__version__).split('.')[1])
- if version_major < 1 or (version_major == 1 and version_minor < 11):
- opt_args['use_external_data_format'] = True
- torch.onnx.export(
- simplified_encoder,
- input_ids,
- output_fpath,
- do_constant_folding=True,
- opset_version=13,
- input_names=inputs.get_names(),
- output_names=outputs.get_names(),
- dynamic_axes={
- **inputs.get_torch_dynamic_axis_encoding(),
- **outputs.get_torch_dynamic_axis_encoding(),
- },
- training=torch.onnx.TrainingMode.EVAL,
- **opt_args
- )
-
- if network_metadata.precision.fp16:
- process_onnx([OnnxProcessOperation.MOVE_CAST_OP, OnnxProcessOperation.CLAMP_WEIGHTS], output_fpath, output_fpath)
-
- return T5EncoderONNXFile(output_fpath, network_metadata)
diff --git a/demo/HuggingFace/T5/frameworks.py b/demo/HuggingFace/T5/frameworks.py
deleted file mode 100644
index 2f06128d..00000000
--- a/demo/HuggingFace/T5/frameworks.py
+++ /dev/null
@@ -1,340 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-import os
-import sys
-
-from typing import List, Union
-
-# huggingface
-from transformers import (
- T5ForConditionalGeneration,
- T5Tokenizer,
- T5Config,
-)
-
-# torch
-import torch
-
-# Add syspath for custom library
-if __name__ == "__main__":
- filepath = os.path.dirname(os.path.abspath(__file__))
- project_root = os.path.join(filepath, os.pardir)
- sys.path.append(project_root)
-
-# TRT-HuggingFace
-from NNDF.interface import FrameworkCommand
-from NNDF.torch_utils import expand_inputs_for_beam_search
-from NNDF.networks import (
- BenchmarkingResult,
- NetworkResult,
- NetworkMetadata,
- NetworkRuntime,
- NetworkModels,
- NetworkModel,
- TimingProfile,
-)
-from T5.export import T5EncoderTorchFile, T5DecoderTorchFile
-from T5.T5ModelConfig import T5ModelTRTConfig, T5BenchmarkingArgs
-from T5.measurements import decoder_inference, encoder_inference, full_inference, calculate_perplexity
-from NNDF.general_utils import confirm_folder_delete, NNFolderWorkspace
-
-
-class T5FHuggingFace(FrameworkCommand):
- def __init__(self):
- super().__init__(
- T5ModelTRTConfig, description="Runs framework results for T5 model."
- )
-
- self.onnx_t5_encoder = None
- self.onnx_t5_decoder = None
- self.torch_t5_dir = None
-
- def generate_and_download_framework(
- self, metadata: NetworkMetadata, workspace: NNFolderWorkspace
- ) -> NetworkModels:
-
- trt_t5_config = self.config
- metadata_serialized = trt_t5_config.get_metadata_string(metadata)
- workspace_dir, encoder_onnx_root, decoder_onnx_root = workspace.set_model_path(metadata_serialized, is_encoder_decoder = True)
- pytorch_model_dir = os.path.join(workspace_dir, "pytorch_model")
- # We keep track of the generated torch location for cleanup later
- self.torch_t5_dir = pytorch_model_dir
-
- model = None
- if not os.path.exists(pytorch_model_dir):
- # Generate the pre-trained weights
- model = T5ForConditionalGeneration.from_pretrained(
- metadata.variant, use_cache = metadata.other.kv_cache
- )
- model.save_pretrained(pytorch_model_dir)
- print("Pytorch Model saved to {}".format(pytorch_model_dir))
- else:
- print(
- "Frameworks file already exists, skipping generation and loading from file instead."
- )
- model = T5ForConditionalGeneration.from_pretrained(
- pytorch_model_dir,
- use_cache = metadata.other.kv_cache
- )
-
- # These ONNX models can be converted using special encoder and decoder classes.
- encoder_onnx_model_fpath = os.path.join(encoder_onnx_root, metadata_serialized + "-encoder.onnx")
- decoder_onnx_model_fpath = os.path.join(decoder_onnx_root, metadata_serialized + "-decoder-with-lm-head.onnx")
-
- t5_encoder = T5EncoderTorchFile(model, metadata)
- t5_decoder = T5DecoderTorchFile(model, metadata)
- self.onnx_t5_encoder = t5_encoder.as_onnx_model(
- encoder_onnx_model_fpath, force_overwrite=False
- )
- self.onnx_t5_decoder = t5_decoder.as_onnx_model(
- decoder_onnx_model_fpath, force_overwrite=False
- )
-
- onnx_models = [
- NetworkModel(
- name=T5ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
- fpath=self.onnx_t5_decoder.fpath,
- ),
- NetworkModel(
- name=T5ModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME,
- fpath=self.onnx_t5_encoder.fpath,
- ),
- ]
- torch_models = [
- NetworkModel(
- name=T5ModelTRTConfig.NETWORK_FULL_NAME, fpath=pytorch_model_dir
- )
- ]
-
- return NetworkModels(torch=torch_models, onnx=onnx_models, trt=None)
-
- def cleanup(
- self,
- workspace: NNFolderWorkspace,
- keep_onnx_model: bool = True,
- keep_pytorch_model: bool = True,
- ) -> None:
- """
- Cleans up the working directory and leaves models if available.
- Should not assume any functions from the framework class has been called.
- Return:
- None
- """
- # Clean-up generated files
- if not keep_onnx_model:
- if self.onnx_t5_decoder is not None:
- self.onnx_t5_decoder.cleanup()
- if self.onnx_t5_encoder is not None:
- self.onnx_t5_encoder.cleanup()
-
- if not keep_pytorch_model:
- # Using rmtree can be dangerous, have user confirm before deleting.
- confirm_folder_delete(
- self.torch_t5_dir,
- prompt="Confirm you want to delete downloaded pytorch model folder?",
- )
-
- if not keep_pytorch_model and not keep_onnx_model:
- workspace.cleanup(force_remove=False)
-
- def setup_tokenizer_and_model(
- self,
- metadata: NetworkMetadata,
- network_fpaths: NetworkModels,
- ):
- tokenizer = T5Tokenizer.from_pretrained(metadata.variant)
-
- # By default, huggingface model structure is one giant file.
- t5_torch_fpath = network_fpaths.torch[0].fpath
- t5_model = T5ForConditionalGeneration.from_pretrained(t5_torch_fpath, use_cache=metadata.other.kv_cache)
- if metadata.precision.fp16:
- t5_model = t5_model.cuda().half()
-
- t5_torch_encoder = T5EncoderTorchFile.TorchModule(t5_model.encoder)
- t5_torch_decoder = T5DecoderTorchFile.TorchModule(
- t5_model.decoder, t5_model.lm_head, t5_model.config
- )
-
- return tokenizer, t5_torch_encoder, t5_torch_decoder
-
- def execute_inference(
- self,
- metadata: NetworkMetadata,
- network_fpaths: NetworkModels,
- inference_input: str,
- timing_profile: TimingProfile,
- use_cpu: bool,
- batch_size: int = 1,
- num_beams: int = 1,
- benchmarking_mode: bool = False,
- benchmarking_args: T5BenchmarkingArgs = None,
- ) -> Union[NetworkResult, BenchmarkingResult]:
-
- tokenizer, t5_torch_encoder, t5_torch_decoder = self.setup_tokenizer_and_model(metadata, network_fpaths)
- hf_config = T5Config.from_pretrained(metadata.variant, use_cache = metadata.other.kv_cache)
- # Prepare the input tokens and find out output sequence length..
- if not benchmarking_mode:
- output_seq_len = T5ModelTRTConfig.MAX_SEQUENCE_LENGTH[metadata.variant]
- input_ids = tokenizer([inference_input] * batch_size, padding=True, return_tensors="pt").input_ids
- else:
- max_seq_len = T5ModelTRTConfig.MAX_SEQUENCE_LENGTH[metadata.variant]
- input_seq_len = benchmarking_args.input_seq_len if benchmarking_args.input_seq_len > 0 else max_seq_len
- output_seq_len = benchmarking_args.output_seq_len if benchmarking_args.output_seq_len > 0 else max_seq_len
- input_ids = torch.randint(0, hf_config.vocab_size, (batch_size, input_seq_len))
-
- encoder_last_hidden_state, encoder_e2e_time = encoder_inference(
- t5_torch_encoder, input_ids, timing_profile, use_cuda=(not use_cpu)
- )
-
- # Need to feed the decoder a new empty input_ids for text generation.
- decoder_output_len = output_seq_len // 2 if (not metadata.other.kv_cache) else 1
-
- decoder_input_ids = torch.full(
- (batch_size, decoder_output_len), tokenizer.convert_tokens_to_ids(tokenizer.pad_token), dtype=torch.int32
- )
-
- _, decoder_e2e_time = decoder_inference(
- t5_torch_decoder,
- expand_inputs_for_beam_search(decoder_input_ids, num_beams) if num_beams > 1 else decoder_input_ids,
- expand_inputs_for_beam_search(encoder_last_hidden_state, num_beams) if num_beams > 1 else encoder_last_hidden_state,
- timing_profile,
- use_cache=metadata.other.kv_cache,
- )
-
- decoder_output, full_e2e_runtime = full_inference(
- t5_torch_encoder,
- t5_torch_decoder,
- input_ids,
- tokenizer,
- timing_profile,
- num_beams=num_beams,
- max_length=output_seq_len,
- min_length=T5ModelTRTConfig.MIN_OUTPUT_LENGTH[metadata.variant] if not benchmarking_mode else output_seq_len,
- use_cuda=(not use_cpu),
- batch_size=batch_size,
- use_cache=metadata.other.kv_cache,
- )
-
- # Prepare runtime results.
- runtime=[
- NetworkRuntime(
- name=T5ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
- runtime=decoder_e2e_time,
- ),
- NetworkRuntime(
- name=T5ModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME,
- runtime=encoder_e2e_time,
- ),
- NetworkRuntime(
- name=T5ModelTRTConfig.NETWORK_FULL_NAME,
- runtime=full_e2e_runtime,
- ),
- ]
-
- # Skip result checking in benchmarking mode since the input data is random.
- if benchmarking_mode:
- return BenchmarkingResult(median_runtime=runtime, models=network_fpaths)
-
- # Remove the padding and end tokens.
- semantic_outputs = tokenizer.decode(
- decoder_output[-1, :], skip_special_tokens=True
- )
-
- if isinstance(semantic_outputs, list):
- semantic_outputs = " ".join(semantic_outputs).strip()
-
- return NetworkResult(
- input=inference_input,
- output_tensor=decoder_output,
- semantic_output=semantic_outputs,
- median_runtime=runtime,
- models=network_fpaths,
- )
-
- def execute_calculate_perplexity(
- self,
- metadata: NetworkMetadata,
- network_fpaths: NetworkModels,
- encoder_input: str,
- decoder_input: str,
- ):
- tokenizer, t5_torch_encoder, t5_torch_decoder = self.setup_tokenizer_and_model(metadata, network_fpaths)
- encoder_input_ids = tokenizer([encoder_input], padding=True, return_tensors="pt").input_ids
- decoder_input_ids = tokenizer([decoder_input], padding=True, return_tensors="pt").input_ids
- perplexity = calculate_perplexity(
- t5_torch_encoder, t5_torch_decoder, tokenizer, encoder_input_ids, decoder_input_ids,
- T5ModelTRTConfig.MAX_SEQUENCE_LENGTH[metadata.variant],
- )
- return perplexity
-
- def run_framework(
- self,
- metadata: NetworkMetadata,
- network_input: List[str],
- working_directory: str,
- keep_onnx_model: bool,
- keep_pytorch_model: bool,
- timing_profile: TimingProfile,
- use_cpu: bool = False,
- batch_size: int = 1,
- args: object = None,
- benchmarking_mode: bool = False,
- perplexity_reference: List[str] = None,
- ) -> Union[List[NetworkResult], BenchmarkingResult]:
- """
- Main entry point of our function which compiles and generates our model data.
- """
- inference_results = []
- ppl_results = []
- workspace = NNFolderWorkspace(
- self.config.network_name, metadata, working_directory
- )
- try:
- network_fpaths = self.generate_and_download_framework(metadata, workspace)
- if not benchmarking_mode:
- for ninput in network_input:
- inference_results.append(
- self.execute_inference(
- metadata, network_fpaths, ninput, timing_profile, use_cpu, batch_size, args.num_beams
- )
- )
- if perplexity_reference is not None:
- assert len(network_input) == len(perplexity_reference), "Encoder and decoder inputs must pair up"
- for ei, di in zip(network_input, perplexity_reference):
- ppl_results.append(
- self.execute_calculate_perplexity(
- metadata, network_fpaths, ei, di
- )
- )
- else:
- benchmarking_args = T5BenchmarkingArgs(args.input_seq_len, args.output_seq_len)
- inference_results = self.execute_inference(
- metadata, network_fpaths, None, timing_profile, use_cpu, batch_size, args.num_beams, True, benchmarking_args
- )
- finally:
- self.cleanup(workspace, keep_onnx_model, keep_pytorch_model)
-
- return inference_results, ppl_results
-
-
-# Entry point
-RUN_CMD = T5FHuggingFace()
-
-if __name__ == "__main__":
- result = RUN_CMD()
- print("Results: {}".format(result))
diff --git a/demo/HuggingFace/T5/measurements.py b/demo/HuggingFace/T5/measurements.py
deleted file mode 100644
index 3b30e8c1..00000000
--- a/demo/HuggingFace/T5/measurements.py
+++ /dev/null
@@ -1,136 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Utils specific to T5 network.
-"""
-
-# torch
-import torch
-
-# TRT-HuggingFace
-from NNDF.general_utils import measure_python_inference_code
-from NNDF.torch_utils import use_cuda, expand_inputs_for_beam_search
-from NNDF.tensorrt_utils import TRTNativeRunner
-from NNDF.logger import G_LOGGER
-from transformers.modeling_outputs import BaseModelOutput
-
-@use_cuda
-def decoder_inference(
- t5_decoder, input_ids, encoder_last_hidden_state, timing_profile, use_cuda=True, use_cache=False, past_key_values=None
-):
- # This implementation is a bit ugly. Moving implementation of the model to check HFRunner would be cleaner.
- if isinstance(t5_decoder, TRTNativeRunner):
- # Function is technically in T5TRTDecoder however due to circular import, TRTNativeRunner in this module scope
- # implies the existence of this function.
- t5_decoder.set_return_device("cuda" if use_cuda else "cpu")
-
- def decoder_stmt():
- t5_decoder(
- input_ids=input_ids, encoder_hidden_states=encoder_last_hidden_state, use_cache=use_cache,
- past_key_values=past_key_values
- )
-
- decoder_e2e_time = measure_python_inference_code(decoder_stmt, timing_profile)
-
- return (decoder_stmt(), decoder_e2e_time)
-
-
-@use_cuda
-def encoder_inference(t5_encoder, input_ids, timing_profile, use_cuda=True):
- encoder_stmt = lambda: t5_encoder(input_ids=input_ids)
- encoder_e2e_time = measure_python_inference_code(encoder_stmt, timing_profile)
-
- return (encoder_stmt(), encoder_e2e_time)
-
-@use_cuda
-def full_inference(
- t5_encoder,
- t5_decoder,
- input_ids,
- tokenizer,
- timing_profile,
- max_length,
- min_length=0,
- num_beams=1,
- batch_size=1,
- use_cuda=True,
- early_stopping=True,
- use_cache=False
-):
-
- G_LOGGER.info(f"Running full inference...")
- encoder_last_hidden_state = t5_encoder(input_ids=input_ids)
-
- def _e2e():
- with torch.no_grad():
- decoder_output = t5_decoder.generate(
- input_ids,
- max_length = max_length,
- min_length = min_length,
- num_beams = num_beams,
- early_stopping = early_stopping,
- eos_token_id = t5_decoder.config.eos_token_id,
- pad_token_id = t5_decoder.config.pad_token_id,
- use_cache = use_cache,
- encoder_outputs = BaseModelOutput(last_hidden_state = encoder_last_hidden_state),
- )
- return decoder_output
-
- if isinstance(t5_decoder, TRTNativeRunner):
- t5_decoder.set_return_device("cuda" if use_cuda else "cpu")
-
- measurement_function = _e2e
-
- full_e2e_time = measure_python_inference_code(measurement_function, timing_profile)
-
- return (measurement_function(), full_e2e_time)
-
-@use_cuda
-def calculate_perplexity(
- t5_encoder,
- t5_decoder,
- tokenizer,
- input_ids,
- decoder_input_ids,
- max_seq_len=None,
- use_cuda=True,
-):
- encoder_last_hidden_state = t5_encoder(input_ids=input_ids)
- if isinstance(t5_decoder, TRTNativeRunner):
- t5_decoder.set_return_device("cuda" if use_cuda else "cpu")
-
- # Set the first token to be pad token
- decoder_input_ids_padded = torch.full(
- decoder_input_ids.size()[:-1] + (decoder_input_ids.size()[-1] + 1,),
- tokenizer.convert_tokens_to_ids(tokenizer.pad_token),
- dtype=decoder_input_ids.dtype,
- )
- decoder_input_ids_padded[..., 1:] = decoder_input_ids
-
- if use_cuda:
- encoder_last_hidden_state = encoder_last_hidden_state.to("cuda")
- decoder_input_ids_padded = decoder_input_ids_padded.to("cuda")
-
- with torch.no_grad():
- if max_seq_len is not None:
- decoder_input_ids_padded = decoder_input_ids_padded[:, :max_seq_len]
- logits = t5_decoder(decoder_input_ids_padded, encoder_last_hidden_state, return_dict=True).logits
- # Truncate the last prediction
- logits = logits[:, :-1, :]
- loss = torch.nn.CrossEntropyLoss()(logits.permute((0, 2, 1)), decoder_input_ids)
- return torch.exp(loss).item()
diff --git a/demo/HuggingFace/T5/onnxrt.py b/demo/HuggingFace/T5/onnxrt.py
deleted file mode 100644
index 499ba2a5..00000000
--- a/demo/HuggingFace/T5/onnxrt.py
+++ /dev/null
@@ -1,342 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Executes ONNX Runtime framework code. See README.md for more information.
-"""
-
-import os
-import sys
-from typing import Dict, List, Tuple
-
-# Add syspath for custom library
-if __name__ == "__main__":
- filepath = os.path.dirname(os.path.abspath(__file__))
- project_root = os.path.join(filepath, os.pardir)
- sys.path.append(project_root)
-
-# huggingface
-from transformers import T5Tokenizer, T5Config, PretrainedConfig
-from transformers.generation_utils import GenerationMixin
-from transformers.modeling_outputs import Seq2SeqLMOutput
-
-# torch
-import torch
-
-# TRT-HuggingFace
-from NNDF.interface import OnnxRTCommand
-from NNDF.torch_utils import expand_inputs_for_beam_search
-from NNDF.networks import (
- BenchmarkingResult,
- NetworkMetadata,
- NetworkModels,
- NetworkModel,
- NetworkResult,
- NetworkRuntime,
- Precision,
- TimingProfile,
-)
-
-from NNDF.general_utils import NNFolderWorkspace
-from NNDF.tensorrt_utils import PolygraphyOnnxRunner
-from T5.frameworks import T5FHuggingFace
-from T5.T5ModelConfig import T5ModelTRTConfig, T5BenchmarkingArgs
-from T5.measurements import decoder_inference, encoder_inference, full_inference
-from NNDF.logger import G_LOGGER
-
-class OnnxHFRunner(PolygraphyOnnxRunner, GenerationMixin):
- """Runner that adds interop support for HF and HF provided greedy_search functions."""
-
- def __init__(self, engine_fpath: str, network_metadata: NetworkMetadata, hf_config: PretrainedConfig):
- super().__init__(engine_fpath, network_metadata)
- # required for greedy search used by generation mixin
- self.main_input_name = "input_ids"
- self.config = hf_config
-
-class T5OnnxEncoder(OnnxHFRunner):
- """OnnxRT implemented network interface that is mainly to check correctness."""
-
- def forward(self, input_ids, *args, **kwargs):
- # Unoptimized unconditional transfer to numpy for interfacing with polygraphy
- input_ids = input_ids.cpu().numpy().astype("int64")
- return torch.from_numpy(self.trt_context.infer({"input_ids": input_ids})["hidden_states"])
-
-class T5OnnxDecoder(OnnxHFRunner):
- def prepare_inputs_for_generation(self, input_ids, **kwargs):
- return {
- "input_ids": input_ids,
- "encoder_hidden_states": kwargs["encoder_outputs"].last_hidden_state,
- }
-
- def forward(self, input_ids, encoder_hidden_states, *args, **kwargs):
- # Unoptimized unconditional transfer to numpy for interfacing with polygraphy
- input_ids = input_ids.cpu().numpy().astype("int64")
- data_type = "float32"
- encoder_hidden_states = encoder_hidden_states.cpu().numpy().astype(data_type)
-
- logits = self.trt_context.infer(
- {"input_ids": input_ids, "encoder_hidden_states": encoder_hidden_states}
- )["hidden_states"]
-
- return Seq2SeqLMOutput(logits=torch.from_numpy(logits))
-
-class T5ONNXRT(OnnxRTCommand):
- def __init__(self):
- super().__init__(
- T5ModelTRTConfig,
- "Runs polygraphy results for T5 model.",
- T5FHuggingFace,
- )
- self.t5_ort_decoder = None
- self.t5_ort_encoder = None
-
- def cleanup(
- self,
- workspace: NNFolderWorkspace,
- keep_onnx_model: bool = False,
- keep_torch_model: bool = False,
- ) -> None:
- # Deactivates context
- if self.t5_ort_encoder:
- self.t5_ort_encoder.release()
- if self.t5_ort_decoder:
- self.t5_ort_decoder.release()
-
- self.frameworks_cmd.cleanup(workspace, keep_onnx_model, keep_torch_model)
-
- def execute_inference(
- self,
- metadata: NetworkMetadata,
- onnx_fpaths: Dict[str, NetworkModel],
- inference_input: str,
- timing_profile: TimingProfile,
- batch_size: int = 1,
- num_beams: int = 1,
- benchmarking_mode: bool = False,
- benchmarking_args: T5BenchmarkingArgs = None,
- ) -> NetworkResult:
-
- hf_config = T5Config.from_pretrained(metadata.variant)
- tokenizer = T5Tokenizer.from_pretrained(metadata.variant)
- # Prepare the input tokens and find out output sequence length.
- if not benchmarking_mode:
- output_seq_len = T5ModelTRTConfig.MAX_SEQUENCE_LENGTH[metadata.variant]
- input_ids = tokenizer([inference_input] * batch_size, padding=True, return_tensors="pt").input_ids
- else:
- max_seq_len = T5ModelTRTConfig.MAX_SEQUENCE_LENGTH[metadata.variant]
- input_seq_len = benchmarking_args.input_seq_len if benchmarking_args.input_seq_len > 0 else max_seq_len
- output_seq_len = benchmarking_args.output_seq_len if benchmarking_args.output_seq_len > 0 else max_seq_len
- input_ids = torch.randint(0, hf_config.vocab_size, (batch_size, input_seq_len))
-
- encoder_last_hidden_state, encoder_e2e_time = encoder_inference(
- self.t5_ort_encoder, input_ids, timing_profile
- )
-
- # Need to feed the decoder a new empty input_ids for text generation.
- decoder_output_len = output_seq_len // 2
-
- decoder_input_ids = torch.full(
- (batch_size, decoder_output_len), tokenizer.convert_tokens_to_ids(tokenizer.pad_token), dtype=torch.int32
- )
- # OnnxRT currently does not enable kv cache
- _, decoder_e2e_time = decoder_inference(
- self.t5_ort_decoder,
- expand_inputs_for_beam_search(decoder_input_ids, num_beams) if num_beams > 1 else decoder_input_ids,
- expand_inputs_for_beam_search(encoder_last_hidden_state, num_beams) if num_beams > 1 else encoder_last_hidden_state,
- timing_profile,
- use_cache=metadata.other.kv_cache,
- )
-
- decoder_output, full_e2e_runtime = full_inference(
- self.t5_ort_encoder,
- self.t5_ort_decoder,
- input_ids,
- tokenizer,
- timing_profile,
- max_length=output_seq_len,
- min_length=T5ModelTRTConfig.MIN_OUTPUT_LENGTH[metadata.variant] if not benchmarking_mode else output_seq_len,
- use_cuda=False,
- num_beams=num_beams,
- batch_size=batch_size,
- use_cache=metadata.other.kv_cache,
- )
-
- # Prepare runtime results.
- runtime = [
- NetworkRuntime(
- name=T5ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
- runtime=decoder_e2e_time,
- ),
- NetworkRuntime(
- name=T5ModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME,
- runtime=encoder_e2e_time,
- ),
- NetworkRuntime(
- name=T5ModelTRTConfig.NETWORK_FULL_NAME,
- runtime=full_e2e_runtime,
- ),
- ]
- models=NetworkModels(
- torch=None,
- onnx=list(onnx_fpaths.values()),
- trt=None
- )
-
- # Skip result checking in benchmarking mode since the input data is random.
- if benchmarking_mode:
- return BenchmarkingResult(median_runtime=runtime, models=models)
-
- # Remove the padding and end tokens.
- semantic_outputs = tokenizer.decode(
- decoder_output[-1, :], skip_special_tokens=True
- )
-
- if isinstance(semantic_outputs, list):
- semantic_outputs = " ".join(semantic_outputs).strip()
-
- return NetworkResult(
- input=inference_input,
- output_tensor=decoder_output,
- semantic_output=semantic_outputs,
- median_runtime=runtime,
- models=models,
- )
-
- def run_onnxrt(
- self,
- metadata: NetworkMetadata,
- onnx_fpaths: Tuple[NetworkModel],
- network_input: List[str],
- working_directory: str,
- keep_onnx_model: bool,
- keep_torch_model: bool,
- timing_profile: TimingProfile,
- batch_size: int = 1,
- args: object = None,
- benchmarking_mode: bool = False,
- ) -> List[NetworkResult]:
- workspace = NNFolderWorkspace(
- self.frameworks_cmd.config.network_name, metadata, working_directory
- )
-
- results = []
- try:
- if metadata.other.kv_cache:
- assert False, "OnnxRT currently does not support kv cache."
- # no fpath provided for onnx files, download them
- if len(onnx_fpaths) == 0:
- onnx_fpaths = self.frameworks_cmd.generate_and_download_framework(
- metadata, workspace
- ).onnx
- else:
- keep_onnx_model = True
- keep_torch_model = True
-
- # Output networks shall not exceed number of network segments explicitly defined by configuration file.
- assert len(onnx_fpaths) == len(
- T5ModelTRTConfig.NETWORK_SEGMENTS
- ), "There should only be {} exported ONNX segments in T5 model.".format(
- len(T5ModelTRTConfig.NETWORK_SEGMENTS)
- )
-
- lookup_onnx_table = {v.name: v for v in onnx_fpaths}
-
- hf_config = T5Config.from_pretrained(
- metadata.variant,
- use_cache=metadata.other.kv_cache
- )
- self.t5_ort_encoder = T5OnnxEncoder(
- lookup_onnx_table["encoder"].fpath, metadata, hf_config
- )
- self.t5_ort_decoder = T5OnnxDecoder(
- lookup_onnx_table["decoder"].fpath, metadata, hf_config
- )
-
- if not benchmarking_mode:
- for ninput in network_input:
- results.append(
- self.execute_inference(
- metadata, lookup_onnx_table, ninput, timing_profile, batch_size, args.num_beams
- )
- )
- else:
- benchmarking_args = T5BenchmarkingArgs(args.input_seq_len, args.output_seq_len)
- results = self.execute_inference(
- metadata, lookup_onnx_table, None, timing_profile, batch_size, args.num_beams, True, benchmarking_args
- )
-
- finally:
- self.cleanup(workspace, keep_onnx_model, keep_torch_model)
- # TODO: Add perplexity calculation for OnnxRT
- G_LOGGER.warning("perplexity calculation is disabled for OnnxRT.")
- return results
-
- def add_args(self, parser) -> None:
- super().add_args(parser)
- onnx_group = parser.add_argument_group("onnx models")
- onnx_group.add_argument(
- "--onnx-decoder-fpath",
- default=None,
- help="Path to ONNX decoder. If None is supplied, scripts will generate them from HuggingFace.",
- )
- onnx_group.add_argument(
- "--onnx-encoder-fpath",
- default=None,
- help="Path to ONNX encoder. If None is supplied, scripts will generate them from HuggingFace.",
- )
-
- def args_to_network_models(self, args) -> List[NetworkModel]:
- # Check if both flags are given otherwise error out
- decoder_fpath_check = args.onnx_decoder_fpath is None
- encoder_fpath_check = args.onnx_encoder_fpath is None
-
- network_models = None
- if decoder_fpath_check and encoder_fpath_check:
- network_models = tuple()
- elif decoder_fpath_check or encoder_fpath_check:
- raise self._parser.error(
- "Both --onnx-decoder-fpath and --onnx-encoder-fpath must be given. Otherwise neither should be provided for script to download them."
- )
- else:
- onnx_decoder = NetworkModel(
- name=T5ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
- fpath=args.onnx_decoder_fpath,
- )
- onnx_encoder = NetworkModel(
- name=T5ModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME,
- fpath=args.onnx_encoder_fpath,
- )
- network_models = (onnx_decoder, onnx_encoder)
-
- return network_models
-
- def args_to_network_metadata(self, args) -> NetworkMetadata:
- """Override args to metadata to use export subroutine."""
- frameworks_parsed_metadata = self.frameworks_cmd.args_to_network_metadata(args)
-
- return NetworkMetadata(
- variant=frameworks_parsed_metadata.variant,
- precision=Precision(fp16=args.fp16),
- other=frameworks_parsed_metadata.other,
- )
-
-
-RUN_CMD = T5ONNXRT()
-
-if __name__ == "__main__":
- result = RUN_CMD()
- print("Results: {}".format(result))
diff --git a/demo/HuggingFace/T5/trt.py b/demo/HuggingFace/T5/trt.py
deleted file mode 100644
index 3a2decc2..00000000
--- a/demo/HuggingFace/T5/trt.py
+++ /dev/null
@@ -1,953 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-import os
-import sys
-import copy
-from typing import Dict, List, Tuple, Union
-from functools import reduce
-
-# Add syspath for custom library
-if __name__ == "__main__":
- filepath = os.path.dirname(os.path.abspath(__file__))
- project_root = os.path.join(filepath, os.pardir)
- sys.path.append(project_root)
-
-# polygraphy
-from polygraphy.backend.trt import Profile
-
-# tensorrt
-import tensorrt as trt
-
-# torch
-import torch
-
-# huggingface
-from transformers import T5Tokenizer, T5Config
-from transformers.modeling_outputs import Seq2SeqLMOutput
-from transformers.configuration_utils import PretrainedConfig
-from transformers.generation_utils import GenerationMixin
-from transformers.modeling_outputs import BaseModelOutput
-
-# tensorrt
-from tensorrt import PreviewFeature
-
-# TRT-HuggingFace
-from NNDF.interface import TRTInferenceCommand
-from NNDF.networks import (
- BenchmarkingResult,
- NetworkMetadata,
- NetworkModels,
- NetworkModel,
- NetworkResult,
- NetworkRuntime,
- Precision,
- TimingProfile,
-)
-
-from NNDF.tensorrt_utils import TRTNativeRunner, set_kv_data, allocate_binding_buffer, setup_benchmark_arg
-from NNDF.torch_utils import expand_inputs_for_beam_search
-from NNDF.general_utils import NNFolderWorkspace
-from T5.frameworks import T5FHuggingFace
-from T5.T5ModelConfig import T5ModelTRTConfig, T5TRTBenchmarkingArgs
-from T5.measurements import decoder_inference, encoder_inference, full_inference, calculate_perplexity
-from T5.export import T5DecoderONNXFile, T5EncoderONNXFile, T5DecoderTRTEngine, T5EncoderTRTEngine
-from NNDF.models import TRTEngineFile
-from NNDF.logger import G_LOGGER
-
-
-class TRTHFRunner(TRTNativeRunner, GenerationMixin):
- """Runner that adds interop support for HF and HF provided greedy_search functions."""
-
- # Stores the encoder input length received at runtime, which is used to slice decoder inputs.
- ENCODER_LENGTH = 0
- def _allocate_memory(self,
- input_shapes: Dict[str, tuple],
- input_types: Dict[str, torch.dtype],
- output_shapes: Dict[str, tuple],
- output_types: Dict[str, torch.dtype]):
- """Helper function for binding several inputs at once and pre-allocating the results."""
- # Allocate memories as 1D linear buffers for simpler handling of dynamic shapes.
- self.inputs = allocate_binding_buffer(input_types, input_shapes)
- self.outputs = allocate_binding_buffer(output_types, output_shapes)
-
- bindings = [None] * self.trt_engine.num_bindings
-
- for input_name, input_array in self.inputs.items():
- # Allocate memory for inputs
- input_idx = self.trt_engine.get_binding_index(input_name)
- self.trt_context.set_binding_shape(input_idx, input_shapes[input_name])
- bindings[input_idx] = input_array.data_ptr()
-
- assert self.trt_context.all_binding_shapes_specified
-
- for output_name, output_array in self.outputs.items():
- # Output shape should be allocated from context size
- output_idx = self.trt_engine.get_binding_index(output_name)
- bindings[output_idx] = output_array.data_ptr()
-
- return bindings
-
- def __init__(
- self,
- trt_engine_file: TRTEngineFile,
- network_metadata: NetworkMetadata,
- hf_config: PretrainedConfig,
- batch_size: int = 1
- ):
- super().__init__(trt_engine_file, network_metadata)
- self.config = hf_config
- self.batch_size = batch_size
-
-class T5TRTEncoder(TRTHFRunner):
- """TRT implemented network interface that can be used to measure inference time."""
-
- def __init__(
- self,
- trt_engine_file: str,
- network_metadata: NetworkMetadata,
- hf_config: PretrainedConfig,
- batch_size: int = 1,
- benchmarking_args: T5TRTBenchmarkingArgs = None
- ):
- super().__init__(trt_engine_file, network_metadata, hf_config, batch_size = batch_size)
- self.data_type = torch.float32
- # In benchmarking mode, the max_sequence_length should be the designated input_profile_max_len
- if benchmarking_args is not None and benchmarking_args.input_profile_max_len is not None:
- self.max_sequence_length = benchmarking_args.input_profile_max_len
- else:
- self.max_sequence_length = hf_config.d_model
- self.encoder_hidden_size = hf_config.d_model
- self.main_input_name = "input_ids"
- # We only have one profile to select so we can just grab the profile at the start of the class
- self.profile_idx = self.get_optimization_profile(batch_size=self.batch_size, sequence_length=1)
-
- self.input_shapes = {
- "input_ids": (self.batch_size, self.max_sequence_length)
- }
- self.input_types = {
- "input_ids": torch.int32
- }
- self.output_shapes = {
- "hidden_states": (self.batch_size, self.max_sequence_length, self.encoder_hidden_size)
- }
- self.output_types = {
- "hidden_states": self.data_type
- }
-
- self.bindings = self._allocate_memory(self.input_shapes, self.input_types, self.output_shapes, self.output_types)
-
- def forward(self, input_ids, *args, **kwargs):
- bs = self.batch_size
- max_length = self.max_sequence_length
- TRTHFRunner.ENCODER_LENGTH = input_ids.shape[1]
- input_length = input_ids.shape[1]
- encoder_hidden_size = self.encoder_hidden_size
-
- # Check if the input data is on CPU (which usually means the PyTorch does not support current GPU).
- is_cpu_mode = (input_ids.device == torch.device("cpu"))
-
- # We allocate the buffers using max_length, but we only need to first portion of it, so copy the data into the
- # first portion of the input buffer.
- # TODO: Could we just reuse input_ids' data_ptr() as the first binding when input_ids is already contiguous to
- # avoid an additional D2D?
- if is_cpu_mode:
- self.inputs["input_ids"] = input_ids.int().flatten().contiguous().cuda()
- self.bindings[0] = self.inputs["input_ids"].data_ptr()
- else:
- self.inputs["input_ids"][:bs * input_length] = input_ids.flatten()
-
- # Set the binding shape of input_ids, which should be (bs, input_length).
- self.trt_context.set_binding_shape(0, input_ids.shape)
-
- # Launch TRT inference.
- # TODO: Could we use execute_v2_async() instead of execute_v2()?
- self.trt_context.execute_v2(bindings=self.bindings)
-
- # We allocate the buffers using max_length, but we only need to first portion of it, so get only the first
- # portion of the output buffer and return that.
- # TODO: Could we construct a Torch tensor using given data_ptr() to avoid this D2D copy?
- hidden_states_output = self.outputs["hidden_states"]
- if is_cpu_mode:
- hidden_states_output = hidden_states_output.cpu()
-
- folded = hidden_states_output[:bs * input_length * encoder_hidden_size].view(bs, input_length, encoder_hidden_size)
-
- return folded
-
-class T5TRTDecoder(TRTHFRunner):
-
- def __init__(
- self,
- trt_engine_file: TRTEngineFile,
- network_metadata: NetworkMetadata,
- hf_config: PretrainedConfig,
- batch_size: int = 1,
- num_beams: int = 1,
- benchmarking_args: T5TRTBenchmarkingArgs = None,
- ):
- super().__init__(trt_engine_file, network_metadata, hf_config, batch_size = batch_size)
- self.data_type = torch.float32 if not network_metadata.precision.fp16 else torch.float16
-
- # In benchmarking mode, the max_sequence_length should be the user-provided input_profile_max_len
- if benchmarking_args is not None and benchmarking_args.input_profile_max_len is not None:
- self.max_input_length = benchmarking_args.input_profile_max_len
- else:
- self.max_input_length = hf_config.d_model
-
- # Similarly, the max_output_length should be the user-provided output_profile_max_len
- if benchmarking_args is not None and benchmarking_args.output_profile_max_len is not None:
- self.max_output_length = benchmarking_args.output_profile_max_len
- else:
- self.max_output_length = hf_config.d_model
-
- self.device = torch.device('cuda')
- self.main_input_name = "input_ids"
- self.encoder_hidden_size = hf_config.d_model
- self.num_heads = hf_config.num_heads
- self.embedding_size_per_head = hf_config.d_kv
- self.num_decoder_layers = hf_config.num_decoder_layers
- self.profile_idx = 0
- self.bindings = [0] * self.trt_engine.num_bindings
-
- hidden_states_profile_length = self.max_output_length if not self.config.use_cache else 1
- # Construct buffer for hidden states outputs
- self.hidden_states = torch.zeros((self.batch_size * num_beams, hidden_states_profile_length, hf_config.vocab_size), dtype = self.data_type).cuda()
- self.bindings[self.trt_engine.get_binding_index("hidden_states")] = self.hidden_states.data_ptr()
-
- if self.config.use_cache:
-
- self.self_attention_cache = {}
- self.cross_attention_cache = {}
-
- # We are using cached cross attention, and not outputing redundant cross attention information. We only output self attention cache increment
- self_attention_kv_shape = (self.batch_size * num_beams, self.num_heads, self.max_output_length - 1, self.embedding_size_per_head)
- cross_attention_kv_shape = (self.batch_size * num_beams, self.num_heads, self.max_input_length, self.embedding_size_per_head)
-
- # Set self attention kv cache shape and type
- for i in range(self.num_decoder_layers):
- for code in ["key", "value"]:
- # Allocate self attention buffer. The buffer is used both as inputs and outputs
- self_attention_name = f"key_values.{i}.decoder.{code}"
- input_buffer = torch.zeros(self_attention_kv_shape, dtype = self.data_type).cuda()
- input_idx = self.trt_engine.get_binding_index("past_" + self_attention_name)
- self.self_attention_cache[self_attention_name] = input_buffer
- self.bindings[input_idx] = input_buffer.data_ptr()
-
- output_idx = self.trt_engine.get_binding_index("present_" + self_attention_name)
- self.bindings[output_idx] = input_buffer.data_ptr()
-
- # Allocate cross attention buffer
- cross_attention_past_name = f"past_key_values.{i}.encoder.{code}"
- cross_attention_buffer = torch.zeros(cross_attention_kv_shape, dtype = self.data_type).cuda()
- cross_attention_idx = self.trt_engine.get_binding_index(cross_attention_past_name)
- self.cross_attention_cache[cross_attention_past_name] = cross_attention_buffer
- self.bindings[cross_attention_idx] = cross_attention_buffer.data_ptr()
-
- self.kv_cache_binding_offset = 2 # 0: input_ids, 1: encoder_hidden_states, kv cache input indices start from 2
- self.past_decoder_length = 0
-
- # Optimization bit
- self.persist_encoder_hidden_states = False
- self.encoder_hidden_states = torch.zeros((self.batch_size * num_beams * self.max_input_length * self.encoder_hidden_size), dtype=self.data_type).cuda()
- self.bindings[1] = self.encoder_hidden_states.data_ptr()
- self.persist_cross_attention_kv_cache = False
-
- self.return_device = torch.device('cuda')
- self.variant = network_metadata.variant # record variant name to later index the vocab_size in forward()
-
- def set_encoder_hidden_states_for_inference_cycle(self, encoder_hidden_states):
- """Used to cache encoder hidden state runs across same encoder sessions"""
-
- # Use in-place assignment so that the memory location of self.encoder_hidden_states will never change.
- # PyTorch will handle the FP32->FP16 conversion automatically if that is needed.
- self.encoder_hidden_states[:encoder_hidden_states.numel()] = encoder_hidden_states.flatten()
- self.persist_encoder_hidden_states = True
- self.trt_context.set_binding_shape(1, encoder_hidden_states.shape)
-
- def set_cross_attention_kv_cache_engine(self, cross_attention_kv_generator):
- self.cross_attention_kv_generator = cross_attention_kv_generator
- with open(self.cross_attention_kv_generator.fpath, "rb") as f:
- trt_runtime = trt.Runtime(self.trt_logger)
- self.cross_attention_kv_generator_trt_engine = trt_runtime.deserialize_cuda_engine(f.read())
- self.cross_attention_kv_generator_trt_context = self.cross_attention_kv_generator_trt_engine.create_execution_context()
- self.cross_attention_bindings = [None] * self.cross_attention_kv_generator_trt_engine.num_bindings
- self.cross_attention_bindings[0] = self.encoder_hidden_states.data_ptr()
- # Cross attention cache as outputs
- for i in range(self.num_decoder_layers):
- self.cross_attention_bindings[2*i+1] = self.cross_attention_cache[f"past_key_values.{i}.encoder.key"].data_ptr()
- self.cross_attention_bindings[2*i+2] = self.cross_attention_cache[f"past_key_values.{i}.encoder.value"].data_ptr()
-
- def set_cross_attention_kv_cache_for_inference_cycle(self, encoder_hidden_states):
- """
- Used to cache encoder-decoder cross attention kv caches across same encoder sessions.
-
- Unlike self-attention cache, cross attention is constant during the decoding process, so we only need to set its bindings once at the first decoding step, and skip in all later steps (by self.persist_cross_attention_kv_cache flag)
- """
- self.cross_attention_kv_generator_trt_context.set_binding_shape(0, encoder_hidden_states.shape)
- assert self.cross_attention_kv_generator_trt_context.all_binding_shapes_specified
- self.cross_attention_kv_generator_trt_context.execute_v2(bindings=self.cross_attention_bindings)
- self.persist_cross_attention_kv_cache = True
-
- def set_return_device(self, return_device):
- """
- Sets the return device of the return via to(). Device name should be the same as torch devices: cuda, cpu, etc.
- This is used in our measurement code.
- """
- self.return_device = return_device
- self.device = return_device
-
- def _reorder_cache(self, past, beam_idx):
- # Reference: https://huggingface.co/transformers/v4.11.3/_modules/transformers/models/t5/modeling_t5.html
- # Note that for BART, this function is static, but for T5, it is not
- # if decoder past is not included in output
- # speedy decoding is disabled and no need to reorder
- if past is None:
- print("You might want to consider setting `use_cache=True` to speed up decoding")
- return past
-
- reordered_decoder_past = ()
- for layer_past_states in past:
- # get the correct batch idx from layer past batch dim
- # batch dim of `past` is at 2nd position
- reordered_layer_past_states = ()
- for layer_past_state in layer_past_states:
- if layer_past_state is not None:
- # need to set correct `past` for each of the four key / value states
- reordered_layer_past_states = reordered_layer_past_states + (
- layer_past_state.index_select(0, beam_idx.to(layer_past_state.device)),
- )
- else:
- reordered_layer_past_states = reordered_layer_past_states + (None,)
-
- assert reordered_layer_past_states[0].shape == layer_past_states[0].shape
- assert len(reordered_layer_past_states) == len(layer_past_states)
-
- reordered_decoder_past = reordered_decoder_past + (reordered_layer_past_states,)
- return reordered_decoder_past
-
- def forward(self, input_ids, encoder_hidden_states, encoder_outputs=None, *args, **kwargs):
- # Get the batch size.
- bs = input_ids.shape[0] # in beam search mode, bs is batch_size * num_beams
-
- # Actual sequence length of the input_ids and the output hidden_states.
- input_length = input_ids.shape[1]
-
- # The sequence length of the encoder_hidden_states.
- encoder_length = TRTHFRunner.ENCODER_LENGTH
-
- is_cpu_mode = (input_ids.device == torch.device("cpu")) or (self.return_device == "cpu")
-
- if is_cpu_mode:
- input_ids = input_ids.int().cuda()
-
- # input_ids needs to be an in int type.
- self.bindings[0] = input_ids.int().data_ptr()
- self.trt_context.set_binding_shape(0, input_ids.shape)
-
- # If encoder hidden states have not been copied yet, copy the hidden states to the input buffer.
- if not self.persist_encoder_hidden_states:
- self.set_encoder_hidden_states_for_inference_cycle(encoder_hidden_states)
-
- if self.config.use_cache:
- if (kwargs.get("past_key_values") is None):
- self.past_decoder_length = 0
- if not self.persist_cross_attention_kv_cache:
- self.set_cross_attention_kv_cache_for_inference_cycle(encoder_hidden_states)
- cross_attention_kv_shape = (bs, self.num_heads, encoder_length, self.embedding_size_per_head)
- for i in range(self.num_decoder_layers):
- self.trt_context.set_binding_shape(self.kv_cache_binding_offset+4*i + 2, cross_attention_kv_shape)
- self.trt_context.set_binding_shape(self.kv_cache_binding_offset+4*i + 3, cross_attention_kv_shape)
-
- # When switching trt profiles, the binding shape needs to be reset, so we set binding shape at each forward pass
- self_attention_kv_shape = (bs, self.num_heads, self.past_decoder_length, self.embedding_size_per_head)
- for i in range(self.num_decoder_layers):
- self.trt_context.set_binding_shape(self.kv_cache_binding_offset+4*i, self_attention_kv_shape)
- self.trt_context.set_binding_shape(self.kv_cache_binding_offset+4*i + 1, self_attention_kv_shape)
-
- # Launch TRT inference.
- assert self.trt_context.all_binding_shapes_specified
- self.trt_context.execute_v2(bindings=self.bindings)
-
- # For bs > 1, this is required, so cannot avoid this D2D copy
- logits_length = bs * input_length * self.config.vocab_size
- logits = self.hidden_states.flatten()[:logits_length].view(bs, input_length, self.config.vocab_size)
- if is_cpu_mode:
- logits = logits.cpu()
-
- present_key_values = None
- if self.config.use_cache:
- present_key_values = ()
- num_heads = self.num_heads
- embedding_size_per_head = self.embedding_size_per_head
-
- for i in range(self.num_decoder_layers):
- self_attention_k_output = self.self_attention_cache[f"key_values.{i}.decoder.key"]
- self_attention_v_output = self.self_attention_cache[f"key_values.{i}.decoder.value"]
- if is_cpu_mode:
- self_attention_k_output = self_attention_k_output.cpu()
- self_attention_v_output = self_attention_v_output.cpu()
-
- present_key_values += ((self_attention_k_output, self_attention_v_output),)
-
- self.past_decoder_length += 1
-
- # Transfer predictions back from GPU to do greedy search
- return Seq2SeqLMOutput(logits=logits.to(self.return_device), past_key_values=present_key_values,)
-
- def prepare_inputs_for_generation(self, input_ids, past=None, use_cache=None, **kwargs):
- # In HuggingFace generation_utils.py, this function will be called at each decoding step, before running the decoder's forward().
-
- if past is not None:
- input_ids = input_ids[:, -1:]
-
- ret = {
- "input_ids": input_ids,
- "encoder_hidden_states": kwargs["encoder_outputs"].get("last_hidden_state"),
- }
-
- if self.config.use_cache:
- ret["use_cache"] = use_cache
- ret["past_key_values"] = past
-
- return ret
-
- def reset(self):
- '''
- You should always call this function after a use case because T5TRTDecoder does not clear the cached encoder_hidden_states or cross_attention itself.
- '''
- self.persist_encoder_hidden_states = False
- self.encoder_hidden_states.zero_()
- if self.config.use_cache:
- self.persist_cross_attention_kv_cache = False
-
-class T5TRT(TRTInferenceCommand):
- def __init__(self):
- super().__init__(
- T5ModelTRTConfig,
- "Runs trt results for T5 model.",
- T5FHuggingFace,
- )
- self.t5_trt_decoder = None
- self.t5_trt_encoder = None
-
- def cleanup(
- self,
- workspace: NNFolderWorkspace,
- keep_trt_engine: bool = False,
- keep_onnx_model: bool = False,
- keep_torch_model: bool = False,
- ) -> None:
- # Deactivates context
- if self.t5_trt_encoder:
- self.t5_trt_encoder.release()
- if self.t5_trt_decoder:
- self.t5_trt_decoder.release()
-
- if not keep_trt_engine:
- self.t5_trt_encoder_engine.cleanup()
- self.t5_trt_decoder_engine.cleanup()
- # TODO: Avoid using workspace.metadata to handle additional removals.
- if workspace.metadata.other.kv_cache:
- self.t5_trt_cross_attention_kv_generator.cleanup()
-
- self.frameworks_cmd.cleanup(workspace, keep_onnx_model, keep_torch_model)
-
- def generate(
- self,
- input_ids,
- min_length: int = None,
- max_length: int = None,
- num_beams: int = 1,
- use_cache: bool = False,
- early_stopping: bool = True,
- ):
- batch_size = input_ids.shape[0]
- hf_config = self.t5_trt_decoder.config
-
- if max_length is None:
- max_length = T5ModelTRTConfig.MAX_OUTPUT_LENGTH[self.metadata.variant]
-
- if min_length is None:
- min_length = T5ModelTRTConfig.MIN_OUTPUT_LENGTH[self.metadata.variant]
-
- encoder_last_hidden_state = self.t5_trt_encoder(input_ids=input_ids).to("cuda")
-
- decoder_output = self.t5_trt_decoder.generate(
- input_ids,
- max_length = max_length,
- min_length = min_length,
- num_beams = num_beams,
- early_stopping = early_stopping,
- eos_token_id = self.t5_trt_decoder.config.eos_token_id,
- pad_token_id = self.t5_trt_decoder.config.pad_token_id,
- use_cache = use_cache,
- encoder_outputs = BaseModelOutput(last_hidden_state = encoder_last_hidden_state),
- )
-
- self.t5_trt_decoder.reset()
- return decoder_output
-
- def execute_inference(
- self,
- metadata: NetworkMetadata,
- onnx_fpaths: Dict[str, NetworkModel],
- inference_input: str,
- timing_profile: TimingProfile,
- batch_size: int = 1,
- num_beams: int = 1,
- benchmarking_mode: bool = False,
- benchmarking_args: T5TRTBenchmarkingArgs = None,
- ) -> Union[NetworkResult, BenchmarkingResult]:
-
- tokenizer = T5Tokenizer.from_pretrained(metadata.variant)
- hf_config = self.t5_trt_decoder.config
- # Prepare the input tokens and find out output sequence length.
- if not benchmarking_mode:
- output_seq_len = T5ModelTRTConfig.MAX_OUTPUT_LENGTH[metadata.variant]
- input_ids = tokenizer([inference_input] * batch_size, padding=True, return_tensors="pt").input_ids
- else:
- input_seq_len = benchmarking_args.input_seq_len
- output_seq_len = benchmarking_args.output_seq_len
-
- input_ids = torch.randint(0, hf_config.vocab_size, (batch_size, input_seq_len))
-
- encoder_last_hidden_state, encoder_e2e_time = encoder_inference(
- self.t5_trt_encoder, input_ids, timing_profile
- )
-
- # Need to feed the decoder a new empty input_ids for text generation.
- decoder_output_len = output_seq_len // 2 if (not metadata.other.kv_cache) else 1
-
- decoder_input_ids = torch.full(
- (batch_size, decoder_output_len), tokenizer.convert_tokens_to_ids(tokenizer.pad_token), dtype=torch.int32
- )
-
- _, decoder_e2e_time = decoder_inference(
- self.t5_trt_decoder,
- expand_inputs_for_beam_search(decoder_input_ids, num_beams) if num_beams > 1 else decoder_input_ids,
- expand_inputs_for_beam_search(encoder_last_hidden_state, num_beams) if num_beams > 1 else encoder_last_hidden_state,
- timing_profile,
- use_cache=metadata.other.kv_cache,
- )
-
- self.t5_trt_decoder.reset()
-
- decoder_output, full_e2e_runtime = full_inference(
- self.t5_trt_encoder,
- self.t5_trt_decoder,
- input_ids,
- tokenizer,
- timing_profile,
- max_length=output_seq_len,
- min_length=T5ModelTRTConfig.MIN_OUTPUT_LENGTH[metadata.variant] if not benchmarking_mode else output_seq_len,
- batch_size=batch_size,
- use_cache=metadata.other.kv_cache,
- num_beams = num_beams,
- )
-
- # Prepare runtime results.
- runtime = [
- NetworkRuntime(
- name=T5ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
- runtime=decoder_e2e_time,
- ),
- NetworkRuntime(
- name=T5ModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME,
- runtime=encoder_e2e_time,
- ),
- NetworkRuntime(
- name=T5ModelTRTConfig.NETWORK_FULL_NAME,
- runtime=full_e2e_runtime,
- ),
- ]
- models=NetworkModels(
- torch=None,
- onnx=list(onnx_fpaths.values()),
- trt=[
- NetworkModel(
- name=T5ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
- fpath=self.t5_trt_decoder_engine.fpath,
- ),
- NetworkModel(
- name=T5ModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME,
- fpath=self.t5_trt_encoder_engine.fpath,
- ),
- ],
- )
-
- # Skip result checking in benchmarking mode since the input data is random.
- if benchmarking_mode:
- return BenchmarkingResult(median_runtime=runtime, models=models)
-
- # Remove the padding and end tokens.
- semantic_outputs = tokenizer.decode(
- decoder_output[0, :], skip_special_tokens=True
- )
-
- if isinstance(semantic_outputs, list):
- semantic_outputs = " ".join(semantic_outputs).strip()
-
- return NetworkResult(
- input=inference_input,
- output_tensor=decoder_output,
- semantic_output=semantic_outputs,
- median_runtime=runtime,
- models=models,
- )
-
- def execute_calculate_perplexity(
- self,
- metadata: NetworkMetadata,
- encoder_input: str,
- decoder_input: str,
- batch_size: int,
- ):
- tokenizer = T5Tokenizer.from_pretrained(metadata.variant)
- encoder_input_ids = tokenizer([encoder_input] * batch_size, padding=True, return_tensors="pt").input_ids
- decoder_input_ids = tokenizer([decoder_input] * batch_size, padding=True, return_tensors="pt").input_ids
-
- perplexity = calculate_perplexity(
- self.t5_trt_encoder, self.t5_trt_decoder, tokenizer, encoder_input_ids, decoder_input_ids,
- T5ModelTRTConfig.MAX_SEQUENCE_LENGTH[metadata.variant],
- )
- return perplexity
-
- def _setup_engines(
- self,
- metadata: NetworkMetadata,
- hash_onnx_fpath: Dict[str, NetworkModel],
- batch_size: int,
- num_beams: int,
- disable_preview_dynamic_shapes: bool,
- benchmarking_args: T5TRTBenchmarkingArgs = None,
- seq_tag: bool = False, # whether the benchmark engine tag format should be seq or max
- ) -> None:
-
- # Output networks shall not exceed number of network segments explicitly defined by configuration file.
- assert len(hash_onnx_fpath) == len(
- T5ModelTRTConfig.NETWORK_SEGMENTS
- ), "There should only be {} exported ONNX segments in T5 model.".format(
- len(T5ModelTRTConfig.NETWORK_SEGMENTS)
- )
-
- decoder_onnx_fpath = hash_onnx_fpath[
- T5ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME
- ].fpath
- encoder_onnx_fpath = hash_onnx_fpath[
- T5ModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME
- ].fpath
-
- # Use HuggingFace T5Config to set up parameter instead of harc-coded values.
- hf_config = T5Config.from_pretrained(
- metadata.variant,
- use_cache=metadata.other.kv_cache
- )
-
- # Generate optimization profiles.
- # non-benchmarking mode: opt profile length is by default half of the max profile
- # benchmarking mode: user can specify opt and max profile by flags. If no additional benchmarking flags are provided, it will just use the non-benchmarking mode defaults
- max_input_length = hf_config.d_model
- max_output_length = hf_config.d_model
- opt_input_seq_len = max_input_length // 2
- opt_output_seq_len = max_output_length // 2
-
- # benchmarking flags
- if benchmarking_args is not None:
- max_input_length = benchmarking_args.input_profile_max_len
- max_output_length = benchmarking_args.output_profile_max_len
- opt_input_seq_len = benchmarking_args.input_seq_len
- opt_output_seq_len = benchmarking_args.output_seq_len
-
- encoder_hidden_size = hf_config.d_model
-
- encoder_profiles = [
- Profile().add(
- "input_ids",
- min=(batch_size, 1),
- opt=(batch_size, opt_input_seq_len),
- max=(batch_size, max_input_length),
- )
- ]
-
- # Set up the non kv engine, used for non-kv mode and kv mode generation phase (1st decoder run uses the non-kv profile to generate kv cache)
- dec_profiles = Profile()
-
- # for beam search, decoder engine's inputs are expanded `num_beams` times
- # optimization profiles should be changed accordingly, but onnx models can be shared across greedy/beam because the first dim (batch size) is already a dynamic value, so no change needed in export.py
- if not hf_config.use_cache:
- dec_profiles = dec_profiles.add(
- "input_ids",
- min=(batch_size * num_beams, 1),
- opt=(batch_size * num_beams, opt_output_seq_len),
- max=(batch_size * num_beams, max_output_length),
- )
- else:
- dec_profiles = dec_profiles.add(
- "input_ids",
- min=(batch_size * num_beams, 1),
- opt=(batch_size * num_beams, 1),
- max=(batch_size * num_beams, 1),
- )
-
- dec_profiles = dec_profiles.add(
- "encoder_hidden_states",
- min=(batch_size * num_beams, 1, encoder_hidden_size),
- opt=(batch_size * num_beams, opt_input_seq_len, encoder_hidden_size),
- max=(batch_size * num_beams, max_input_length, encoder_hidden_size),
- )
-
- if hf_config.use_cache:
-
- num_heads = hf_config.num_heads
- embedding_size_per_head = hf_config.d_kv
- num_decoder_layers = hf_config.num_decoder_layers
- # Use TensorRT Zero-Tensor feature for the 1st decoder run, self attention is growing with increasing sequence.
- self_attention_profile = {
- "min": (batch_size * num_beams, num_heads, 0, embedding_size_per_head),
- "opt": (batch_size * num_beams, num_heads, opt_output_seq_len - 1, embedding_size_per_head),
- "max": (batch_size * num_beams, num_heads, max_output_length - 1, embedding_size_per_head),
- }
-
- # Cross attention kv cache does not change during single decoder iteration.
- cross_attention_profile = {
- "min": (batch_size * num_beams, num_heads, 1, embedding_size_per_head),
- "opt": (batch_size * num_beams, num_heads, opt_input_seq_len, embedding_size_per_head),
- "max": (batch_size * num_beams, num_heads, max_input_length, embedding_size_per_head),
- }
-
- for i in range(num_decoder_layers):
- dec_profiles = dec_profiles.add(
- f"past_key_values.{i}.decoder.key",
- **self_attention_profile
- ).add(
- f"past_key_values.{i}.decoder.value",
- **self_attention_profile
- ).add(
- f"past_key_values.{i}.encoder.key",
- **cross_attention_profile
- ).add(
- f"past_key_values.{i}.encoder.value",
- **cross_attention_profile
- )
-
- decoder_profiles = [dec_profiles]
-
- # Convert ONNX models to TRT engines.
- if benchmarking_args is None:
- engine_tag = "bs{}".format(batch_size)
- # When user does not input any profile_max_len, use seq as tag, both max are config max
- elif seq_tag:
- engine_tag = "bs{}-inseq{}-outseq{}".format(batch_size, benchmarking_args.input_seq_len, benchmarking_args.output_seq_len)
- # When user input profile_max_len, reuse the engine for future use with different seq_len
- else:
- engine_tag = "bs{}-inmax{}-outmax{}".format(batch_size, benchmarking_args.input_profile_max_len, benchmarking_args.output_profile_max_len)
-
- if num_beams > 1:
- engine_tag += "-beam{}".format(num_beams)
-
- preview_features = [PreviewFeature.DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]
- if disable_preview_dynamic_shapes:
- engine_tag += "-noPreviewFasterDynamicShapes"
- else:
- preview_features.append(PreviewFeature.FASTER_DYNAMIC_SHAPES_0805)
-
- self.t5_trt_encoder_engine = T5EncoderONNXFile(
- encoder_onnx_fpath, metadata
- ).as_trt_engine(
- os.path.splitext(encoder_onnx_fpath)[0] + "-{}.engine".format(engine_tag).replace(f"-beam{num_beams}", ""), # encoder engine name not affected by beam search
- profiles=encoder_profiles,
- preview_features=preview_features
- )
-
- self.t5_trt_decoder_engine = T5DecoderONNXFile(
- decoder_onnx_fpath, metadata
- ).as_trt_engine(
- os.path.splitext(decoder_onnx_fpath)[0] + "-{}.engine".format(engine_tag),
- profiles=decoder_profiles,
- preview_features=preview_features
- )
-
- # Create T5TRTEncoder and T5TRTDecoder instances.
- self.t5_trt_encoder = T5TRTEncoder(
- self.t5_trt_encoder_engine, metadata, hf_config, batch_size=batch_size, benchmarking_args=benchmarking_args
- )
- self.t5_trt_decoder = T5TRTDecoder(
- self.t5_trt_decoder_engine, metadata, hf_config, batch_size=batch_size, num_beams=num_beams, benchmarking_args=benchmarking_args
- )
-
- if metadata.other.kv_cache:
- # Set up context phase profile. Context phase will use encoder_hidden_states to generate cross attention kv cache.
- cross_attention_kv_generation_profiles = [Profile().add(
- "encoder_hidden_states",
- min=(batch_size * num_beams, 1, encoder_hidden_size),
- opt=(batch_size * num_beams, opt_input_seq_len, encoder_hidden_size),
- max=(batch_size * num_beams, max_input_length, encoder_hidden_size),
- )]
- decoder_folder, decoder_name = os.path.split(decoder_onnx_fpath)
- decoder_name, decoder_ext = os.path.splitext(decoder_name)
- decoder_onnx_fpath_kv_generator = os.path.join(decoder_folder, "cross_attention_kv_generator", decoder_name + "-cross_attention_kv_generator" + decoder_ext)
- self.t5_trt_cross_attention_kv_generator = T5DecoderONNXFile(
- decoder_onnx_fpath_kv_generator, metadata
- ).as_trt_engine(
- os.path.splitext(decoder_onnx_fpath_kv_generator)[0] + "-{}.engine".format(engine_tag),
- profiles=cross_attention_kv_generation_profiles,
- preview_features=preview_features
- )
-
- self.t5_trt_decoder.set_cross_attention_kv_cache_engine(self.t5_trt_cross_attention_kv_generator)
-
- def run_trt(
- self,
- metadata: NetworkMetadata,
- onnx_fpaths: Tuple[NetworkModel],
- network_input: List[str],
- working_directory: str,
- keep_trt_engine: bool,
- keep_onnx_model: bool,
- keep_torch_model: bool,
- timing_profile: TimingProfile,
- batch_size: int = 1,
- args: object = None,
- benchmarking_mode: bool = False,
- disable_preview_dynamic_shapes: bool = False,
- perplexity_reference: List[str] = None,
- ) -> Union[List[NetworkResult], BenchmarkingResult] :
-
- workspace = self._setup_workspace(metadata, working_directory)
-
- # Keep onnx and Torch models if they are provided by users.
- if len(onnx_fpaths) == 0:
- onnx_fpaths = self._download_models(workspace, metadata)
- else:
- keep_onnx_model = True
- keep_torch_model = True
-
- hash_onnx_fpath = {v.name: v for v in onnx_fpaths}
-
- inference_results = []
- ppl_results = []
- try:
- if not benchmarking_mode:
- self._setup_engines(metadata, hash_onnx_fpath, batch_size, args.num_beams, disable_preview_dynamic_shapes)
- for ninput in network_input:
- inference_results.append(
- self.execute_inference(
- metadata, hash_onnx_fpath, ninput, timing_profile, batch_size, args.num_beams
- )
- )
- self.t5_trt_decoder.reset()
-
- if perplexity_reference is not None:
- assert len(network_input) == len(perplexity_reference), "Encoder and decoder inputs must pair up"
- if metadata.other.kv_cache or (args.num_beams > 1):
- G_LOGGER.warning("Skipping perplexity calculation for TRT with KV cache or beam search because it is not supported yet.")
- else:
- for ei, di in zip(network_input, perplexity_reference):
- ppl_results.append(
- self.execute_calculate_perplexity(metadata, ei, di, batch_size)
- )
- self.t5_trt_decoder.reset()
-
- else:
- # Check that input_seq_len and output_seq_len is valid and within required range
- max_input_seq_len = T5ModelTRTConfig.MAX_SEQUENCE_LENGTH[metadata.variant]
- max_output_seq_len = T5ModelTRTConfig.MAX_OUTPUT_LENGTH[metadata.variant]
-
- seq_tag = args.input_profile_max_len is None and args.output_profile_max_len is None
- # User must provide either a pair of profile_max_len or a profile of seq_len for input/output
- if args.input_profile_max_len is None or args.output_profile_max_len is None:
- if args.input_seq_len is None or args.output_seq_len is None:
- assert False, "Please provide at least one pair of inputs: [input/output]_seq_len or [input/output]_profile_max_len"
-
- input_profile_max_len = setup_benchmark_arg(args.input_profile_max_len, "input_profile_max_len", max_input_seq_len)
- output_profile_max_len = setup_benchmark_arg(args.output_profile_max_len, "output_profile_max_len", max_output_seq_len)
- input_seq_len = setup_benchmark_arg(args.input_seq_len, "input_seq_len", input_profile_max_len // 2)
- output_seq_len = setup_benchmark_arg(args.output_seq_len, "output_seq_len", output_profile_max_len // 2)
-
- benchmarking_args = T5TRTBenchmarkingArgs(input_seq_len, output_seq_len, input_profile_max_len, output_profile_max_len)
-
- # Assert to ensure the validity of benchmarking arguments
- assert benchmarking_args.input_seq_len <= benchmarking_args.input_profile_max_len, "input_seq_len should <= input_profile_max_len = {} for benchmarking mode".format(benchmarking_args.input_profile_max_len)
- assert benchmarking_args.output_seq_len <= benchmarking_args.output_profile_max_len, "output_seq_len should <= output_profile_max_len = {} for benchmarking mode".format(benchmarking_args.output_profile_max_len)
- assert benchmarking_args.input_profile_max_len <= max_input_seq_len, "Model config restrict input_profile_max_len <= {} for benchmark mode".format(max_input_seq_len)
- assert benchmarking_args.output_profile_max_len <= max_output_seq_len, "Model config restrict output_profile_max_len <= {} for benchmark mode".format(max_output_seq_len)
-
- self._setup_engines(metadata, hash_onnx_fpath, batch_size, args.num_beams, disable_preview_dynamic_shapes, benchmarking_args, seq_tag)
- inference_results = self.execute_inference(
- metadata, hash_onnx_fpath, None, timing_profile, batch_size, args.num_beams, True, benchmarking_args
- )
-
- finally:
- self.cleanup(workspace, keep_trt_engine, keep_onnx_model, keep_torch_model)
-
- return inference_results, ppl_results
-
- def add_args(self, parser) -> None:
- super().add_args(parser)
- polygraphy_group = parser.add_argument_group("polygraphy models")
- polygraphy_group.add_argument(
- "--onnx-decoder-fpath",
- default=None,
- help="Path to ONNX decoder. If None is supplied, scripts will generate them from HuggingFace.",
- )
- polygraphy_group.add_argument(
- "--onnx-encoder-fpath",
- default=None,
- help="Path to ONNX encoder. If None is supplied, scripts will generate them from HuggingFace.",
- )
-
- def args_to_network_models(self, args) -> List[NetworkModel]:
- # Check if both flags are given otherwise error out
- decoder_fpath_check = args.onnx_decoder_fpath is None
- encoder_fpath_check = args.onnx_encoder_fpath is None
-
- network_models = None
- if decoder_fpath_check and encoder_fpath_check:
- network_models = tuple()
- elif decoder_fpath_check or encoder_fpath_check:
- raise self._parser.error(
- "Both --onnx-decoder-fpath and --onnx-encoder-fpath must be given. Otherwise neither should be provided for script to download them."
- )
- else:
- onnx_decoder = NetworkModel(
- name=T5ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
- fpath=args.onnx_decoder_fpath,
- )
- onnx_encoder = NetworkModel(
- name=T5ModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME,
- fpath=args.onnx_encoder_fpath,
- )
- network_models = (onnx_decoder, onnx_encoder)
-
- return network_models
-
- def args_to_network_metadata(self, args) -> NetworkMetadata:
- frameworks_parsed_metadata = self.frameworks_cmd.args_to_network_metadata(args)
-
- return NetworkMetadata(
- variant=frameworks_parsed_metadata.variant,
- precision=Precision(fp16=args.fp16),
- other=frameworks_parsed_metadata.other,
- )
-
-
-RUN_CMD = T5TRT()
-
-if __name__ == "__main__":
- result = RUN_CMD()
- print("Results: {}".format(result))
diff --git a/demo/HuggingFace/notebooks/.gitignore b/demo/HuggingFace/notebooks/.gitignore
deleted file mode 100644
index 899448b7..00000000
--- a/demo/HuggingFace/notebooks/.gitignore
+++ /dev/null
@@ -1,2 +0,0 @@
-**/.ipynb_checkpoints
-models/
diff --git a/demo/HuggingFace/notebooks/README.md b/demo/HuggingFace/notebooks/README.md
deleted file mode 100644
index a08cdd15..00000000
--- a/demo/HuggingFace/notebooks/README.md
+++ /dev/null
@@ -1,14 +0,0 @@
-# TensorRT Demo with HuggingFace Models
-
-To run the demo Jupyter notebooks in this folder, follow the instructions in the [TRT setup guide](../../../README.md) to build and launch the docker container, e.g. `./docker/build.sh --file docker/ubuntu-20.04.Dockerfile --tag tensorrt-ubuntu20.04-cuda11.7` and `./docker/launch.sh --tag tensorrt-ubuntu20.04-cuda11.7 --gpus all --jupyter ` by specifying the port number.
-
-Then, use your browser to start the Jupyter lab interface by opening the token-protected link provided in the terminal, e.g. `http://:/lab?token=...`.
-
-Notebook list:
-
-- [gpt2.ipynb](gpt2.ipynb): Step by step walkthrough for building the GPT-2 TensorRT engine.
-- [gpt2-playground.ipynb](gpt2-playground.ipynb): GUI for benchmarking GPT-2 TensorRT engines.
-- [t5.ipynb](t5.ipynb): Step by step walkthrough for building the T5 TensorRT engine.
-- [t5-playground.ipynb](t5-playground.ipynb): GUI for benchmarking T5 TensorRT engines.
-- [bart.ipynb](bart.ipynb): Step by step walkthrough for building the BART TensorRT engine.
-- [bart-playground.ipynb](bart-playground.ipynb): GUI for benchmarking BART TensorRT engines.
diff --git a/demo/HuggingFace/notebooks/bart-playground.ipynb b/demo/HuggingFace/notebooks/bart-playground.ipynb
deleted file mode 100644
index 59e0e20f..00000000
--- a/demo/HuggingFace/notebooks/bart-playground.ipynb
+++ /dev/null
@@ -1,317 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "64974d33-d028-440c-86fa-1a0633b3d31d",
- "metadata": {},
- "outputs": [],
- "source": [
- "# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n",
- "# SPDX-License-Identifier: Apache-2.0\n",
- "#\n",
- "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
- "# you may not use this file except in compliance with the License.\n",
- "# You may obtain a copy of the License at\n",
- "#\n",
- "# http://www.apache.org/licenses/LICENSE-2.0\n",
- "#\n",
- "# Unless required by applicable law or agreed to in writing, software\n",
- "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
- "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
- "# See the License for the specific language governing permissions and\n",
- "# limitations under the License.\n",
- "# =============================================================================="
- ]
- },
- {
- "cell_type": "markdown",
- "id": "c3f0ff46-9958-4d57-9067-a64be34e75da",
- "metadata": {},
- "source": [
- "##### \n",
- "\n",
- "# BART Playground\n",
- "\n",
- "This notebook demonstrates BART model on the task of text summarization and mask filling.\n",
- "\n",
- "The TensorRT HuggingFace BART model is a plug-in replacement for the original PyTorch modules in HuggingFace BART model.\n",
- "\n",
- "**Notes**: \n",
- " - For \"CPU - PyTorch\" and \"GPU - PyTorch\", a BART-base model from HuggingFace model repository is employed. Inference is carried out in FP32 for CPU-PyTorch, and FP16 for GPU-PyTorch and TensorRT. All models run with batch size 1.\n",
- "Average run time across 5 runs is reported.\n",
- " - Prior to running this notebook, run [bart.ipynb](bart.ipynb) to download the BART model and generate the TensorRT engine."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "a005d22d-5b54-4e0c-866e-6eee6a6f98e4",
- "metadata": {},
- "outputs": [],
- "source": [
- "import ipywidgets as widgets\n",
- "\n",
- "model_selection = widgets.RadioButtons(\n",
- " options=['facebook/bart-base', \n",
- " 'facebook/bart-large', \n",
- " 'facebook/bart-large-cnn', \n",
- " 'facebook/mbart-large-50'],\n",
- " description='Model:',\n",
- " disabled=False\n",
- ")\n",
- "\n",
- "display(model_selection)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "d35a33fd-4e85-4a1e-9989-af5adf903f79",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "import os\n",
- "import sys\n",
- "import glob\n",
- "ROOT_DIR = os.path.abspath(\"../\")\n",
- "sys.path.append(ROOT_DIR)\n",
- "\n",
- "import torch \n",
- "\n",
- "# huggingface\n",
- "from transformers import (\n",
- " AutoModelForPreTraining,\n",
- " AutoTokenizer,\n",
- " MBartForConditionalGeneration, \n",
- " MBart50Tokenizer,\n",
- " AutoConfig,\n",
- ")\n",
- "\n",
- "# download HuggingFace model and tokernizer\n",
- "BART_VARIANT = model_selection.value\n",
- "\n",
- "# mbart variant can't be recognized by HF AutoClass yet\n",
- "if \"mbart\" not in BART_VARIANT: \n",
- " bart_model = AutoModelForPreTraining.from_pretrained(BART_VARIANT) # BartForConditionalGeneration\n",
- " tokenizer = AutoTokenizer.from_pretrained(BART_VARIANT) # BartTokenizer\n",
- "else:\n",
- " bart_model = MBartForConditionalGeneration.from_pretrained(BART_VARIANT)\n",
- " tokenizer = MBart50Tokenizer.from_pretrained(BART_VARIANT, src_lang=\"en_XX\")\n",
- "\n",
- "config = AutoConfig.from_pretrained(BART_VARIANT)\n",
- "\n",
- "# load TensorRT engine\n",
- "from BART.trt import BARTTRTEncoder, BARTTRTDecoder, TRTHFRunner\n",
- "from BART.BARTModelConfig import BARTModelTRTConfig, BARTMetadata\n",
- "from BART.export import BARTDecoderTRTEngine, BARTEncoderTRTEngine\n",
- "from NNDF.networks import NetworkMetadata, Precision\n",
- "\n",
- "from transformers.generation_logits_process import (\n",
- " NoRepeatNGramLogitsProcessor,\n",
- " MinLengthLogitsProcessor,\n",
- " ForcedBOSTokenLogitsProcessor,\n",
- " ForcedEOSTokenLogitsProcessor,\n",
- " LogitsProcessorList,\n",
- ")\n",
- "from transformers.generation_stopping_criteria import (\n",
- " MaxLengthCriteria,\n",
- " StoppingCriteriaList,\n",
- ")\n",
- "\n",
- "trt_config = AutoConfig.from_pretrained(BART_VARIANT)\n",
- "trt_config.use_cache = False\n",
- "trt_config.num_layers = BARTModelTRTConfig.NUMBER_OF_LAYERS[BART_VARIANT]\n",
- "\n",
- "metadata=NetworkMetadata(variant=BART_VARIANT, precision=Precision(fp16=True), other=BARTMetadata(kv_cache=False))\n",
- "metadata_string = BARTModelTRTConfig().get_metadata_string(metadata)\n",
- "\n",
- "encoder_stem = metadata_string + \"-encoder.onnx\"\n",
- "decoder_stem = metadata_string + \"-decoder-with-lm-head.onnx\"\n",
- "\n",
- "encoder_path = glob.glob(f'./models/{BART_VARIANT}/tensorrt/{encoder_stem}*')[0]\n",
- "decoder_path = glob.glob(f'./models/{BART_VARIANT}/tensorrt/{decoder_stem}*')[0]\n",
- "\n",
- "if not os.path.exists(encoder_path) or not os.path.exists(decoder_path):\n",
- " print(f\"Error: TensorRT engine not found at ./models/{BART_VARIANT}/tensorrt/. Please run bart.ipynb to generate the TensorRT engines first!\")\n",
- "else:\n",
- " encoder_engine = BARTEncoderTRTEngine(encoder_path, metadata)\n",
- " decoder_engine = BARTDecoderTRTEngine(decoder_path, metadata)\n",
- "\n",
- "bart_trt_encoder = BARTTRTEncoder(encoder_engine, metadata, trt_config)\n",
- "bart_trt_decoder = BARTTRTDecoder(decoder_engine, metadata, trt_config)\n",
- "\n",
- "decoder_input_ids = torch.full(\n",
- " (1, 1), tokenizer.convert_tokens_to_ids(tokenizer.pad_token), dtype=torch.int32\n",
- ").to(\"cuda:0\")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "766b8c94-ba8e-47c8-8624-57da462a0496",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "import numpy as np\n",
- "import time\n",
- "\n",
- "device = widgets.RadioButtons(\n",
- " options=['CPU - PyTorch', \n",
- " 'GPU - PyTorch', \n",
- " 'GPU - TensorRT'],\n",
- " description='Device:',\n",
- " disabled=False\n",
- ")\n",
- "\n",
- "task = widgets.RadioButtons(\n",
- " options=['Summarization', \n",
- " 'Mask Filling', \n",
- " ],\n",
- " description='Task:',\n",
- " disabled=False\n",
- ")\n",
- "\n",
- "example_text = {\n",
- " task.options[0]:\n",
- " \"NVIDIA TensorRT-based applications perform up to 36X faster than CPU-only platforms during inference, enabling developers to optimize neural network models trained on all major frameworks, calibrate for lower precision with high accuracy, and deploy to hyperscale data centers, embedded platforms, or automotive product platforms.\",\n",
- " task.options[1]: \n",
- " \"My friends are but they eat too many carbs.\"\n",
- " }\n",
- " \n",
- "paragraph_text = widgets.Textarea(\n",
- " value=example_text[task.options[0]],\n",
- " placeholder='Type something',\n",
- " description='Context:',\n",
- " disabled=False,\n",
- " layout=widgets.Layout(width=\"auto\"),\n",
- " rows=5, \n",
- ")\n",
- "\n",
- "generated_text = widgets.Textarea(\n",
- " value='...',\n",
- " placeholder='Context',\n",
- " description='BART output:',\n",
- " disabled=False,\n",
- " layout=widgets.Layout(width=\"auto\"),\n",
- " rows=5,\n",
- ")\n",
- "button = widgets.Button(description=\"Generate\")\n",
- "\n",
- "display(paragraph_text)\n",
- "display(generated_text)\n",
- "display(device)\n",
- "display(task)\n",
- "\n",
- "from IPython.display import display\n",
- "box_layout = widgets.Layout(display='flex',\n",
- " flex_flow='column',\n",
- " align_items='center',\n",
- " width='100%')\n",
- "N_RUN = 6\n",
- "progress_bar = widgets.IntProgress(\n",
- " value=0,\n",
- " min=0,\n",
- " max=N_RUN,\n",
- " description='Progress:',\n",
- " bar_style='', # 'success', 'info', 'warning', 'danger' or ''\n",
- " style={'bar_color': 'green'},\n",
- " orientation='horizontal', \n",
- " layout=widgets.Layout(width='100%', height='50px')\n",
- ")\n",
- "\n",
- "box = widgets.HBox(children=[button],layout=box_layout)\n",
- "output = widgets.Output()\n",
- "display(box)\n",
- "display(progress_bar)\n",
- "display(output)\n",
- "\n",
- "max_output_length = BARTModelTRTConfig.MAX_OUTPUT_LENGTH[BART_VARIANT]\n",
- "\n",
- "stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_output_length)])\n",
- "no_repeat_ngram_size = BARTModelTRTConfig.NO_REPEAT_NGRAM_SIZE\n",
- "min_length = BARTModelTRTConfig.MIN_OUTPUT_LENGTH[BART_VARIANT]\n",
- "logits_processor = LogitsProcessorList([\n",
- " NoRepeatNGramLogitsProcessor(no_repeat_ngram_size), \n",
- " MinLengthLogitsProcessor(min_length, tokenizer.convert_tokens_to_ids(tokenizer.eos_token)),\n",
- " ForcedBOSTokenLogitsProcessor(tokenizer.convert_tokens_to_ids(tokenizer.bos_token)),\n",
- " ForcedEOSTokenLogitsProcessor(max_output_length, tokenizer.convert_tokens_to_ids(tokenizer.eos_token))\n",
- "])\n",
- "\n",
- "def generate(b):\n",
- " progress_bar.value = 0\n",
- " inference_time_arr = []\n",
- " inputs = tokenizer(paragraph_text.value, return_tensors=\"pt\")\n",
- " \n",
- " with output:\n",
- " if device.value == 'GPU - TensorRT':\n",
- " for _ in range(N_RUN):\n",
- " start_time = time.time()\n",
- " encoder_last_hidden_state = bart_trt_encoder(input_ids=inputs.input_ids)\n",
- " outputs = bart_trt_decoder.greedy_search(\n",
- " input_ids=decoder_input_ids,\n",
- " encoder_hidden_states=encoder_last_hidden_state,\n",
- " stopping_criteria = stopping_criteria,\n",
- " logits_processor=logits_processor,\n",
- " )\n",
- " inference_time_arr.append(time.time()-start_time)\n",
- " progress_bar.value += 1\n",
- " print(\"GPU - TensorRT - Average inference time: %.2f (ms)\"%(1000*np.mean(inference_time_arr[1:]))) \n",
- " \n",
- " elif device.value == 'CPU - PyTorch':\n",
- " for _ in range(N_RUN):\n",
- " start_time = time.time()\n",
- " outputs = bart_model.float().to('cpu').generate(inputs.input_ids.to('cpu'), num_beams=1, max_length=max_output_length)\n",
- " inference_time_arr.append(time.time()-start_time)\n",
- " progress_bar.value += 1\n",
- " print(\"CPU - PyTorch - Average inference time: %.2f (ms)\"%(1000*np.mean(inference_time_arr[1:])))\n",
- " \n",
- " elif device.value == 'GPU - PyTorch': \n",
- " for _ in range(N_RUN):\n",
- " start_time = time.time()\n",
- " outputs = bart_model.half().to('cuda:0').generate(inputs.input_ids.to('cuda:0'), num_beams=1, max_length=max_output_length)\n",
- " inference_time_arr.append(time.time()-start_time)\n",
- " progress_bar.value += 1\n",
- " print(\"GPU - PyTorch - Average inference time: %.2f (ms)\"%(1000*np.mean(inference_time_arr[1:]))) \n",
- " \n",
- " # de-tokenize model output to raw text\n",
- " text = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
- " generated_text.value = text\n",
- "\n",
- "\n",
- "def switch_task(change):\n",
- " with output:\n",
- " paragraph_text.value = example_text[task.value]\n",
- "\n",
- "task.observe(switch_task, 'value')\n",
- "\n",
- "button.on_click(generate)"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3 (ipykernel)",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.8.10"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/demo/HuggingFace/notebooks/bart.ipynb b/demo/HuggingFace/notebooks/bart.ipynb
deleted file mode 100644
index 5a9dd70b..00000000
--- a/demo/HuggingFace/notebooks/bart.ipynb
+++ /dev/null
@@ -1,1206 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "28e6e614-e360-4292-965e-0d255027e9b9",
- "metadata": {},
- "outputs": [],
- "source": [
- "# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n",
- "# SPDX-License-Identifier: Apache-2.0\n",
- "#\n",
- "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
- "# you may not use this file except in compliance with the License.\n",
- "# You may obtain a copy of the License at\n",
- "#\n",
- "# http://www.apache.org/licenses/LICENSE-2.0\n",
- "#\n",
- "# Unless required by applicable law or agreed to in writing, software\n",
- "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
- "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
- "# See the License for the specific language governing permissions and\n",
- "# limitations under the License.\n",
- "# =============================================================================="
- ]
- },
- {
- "cell_type": "markdown",
- "id": "9b88dc1a-a92d-44cc-9fb7-d9e2ef20c8e2",
- "metadata": {},
- "source": [
- "\n",
- "\n",
- "# Accelerating HuggingFace BART Inference with TensorRT\n",
- "\n",
- "BART is an encoder-decoder model that converts all NLP problems into a text-to-text format. More specifically, it does so by encoding different tasks as text directives in the input stream. This enables a single model to be trained supervised on a wide variety of NLP tasks such as translation, classification, Q&A and summarization.\n",
- "\n",
- "This notebook shows easy steps to convert a [HuggingFace PyTorch BART model](https://huggingface.co/docs/transformers/model_doc/bart) to a TensorRT engine for high-performance inference, with performance comparison between PyTorch and TensorRT inference.\n",
- "\n",
- "1. [Download HuggingFace BART model](#1)\n",
- "1. [PyTorch HuggingFace Inference](#2)\n",
- "1. [TensorRT Engine Building](#3)\n",
- "1. [TensorRT Inference](#4)\n",
- "\n",
- "\n",
- "## Prerequisites\n",
- "\n",
- "Follow the instructions at https://github.com/NVIDIA/TensorRT to build the TensorRT-OSS docker container required to run this notebook.\n",
- "\n",
- "Next, we install some extra dependencies."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "0c36ecb7-c622-4d95-a851-b9a6eb18e81b",
- "metadata": {},
- "outputs": [],
- "source": [
- "#%%capture\n",
- "!pip3 install -r ../requirements.txt\n",
- "!pip3 install ipywidgets"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "a1bbdafb",
- "metadata": {},
- "source": [
- "**Note:** After this step, you should restart the Jupyter kernel for the change to take effect."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "235d2f1b-439e-4cd0-8286-1d63a13f2cf3",
- "metadata": {},
- "outputs": [],
- "source": [
- "import os\n",
- "import sys\n",
- "ROOT_DIR = os.path.abspath(\"../\")\n",
- "sys.path.append(ROOT_DIR)\n",
- "\n",
- "# disable warning in notebook\n",
- "os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n",
- "\n",
- "# notebook widgets\n",
- "import ipywidgets as widgets\n",
- "widget_style = {'description_width': 'initial'}\n",
- "widget_layout = widgets.Layout(width='auto')\n",
- "\n",
- "import torch\n",
- "import tensorrt as trt\n",
- "from tensorrt import PreviewFeature\n",
- "from polygraphy.backend.trt import Profile\n",
- "\n",
- "import numpy as np\n",
- "import time\n",
- "\n",
- "# huggingface\n",
- "from transformers import (\n",
- " AutoModelForPreTraining,\n",
- " AutoTokenizer,\n",
- " AutoConfig,\n",
- ")\n",
- "\n",
- "# BART\n",
- "from BART.BARTModelConfig import BARTModelTRTConfig, BARTMetadata\n",
- "from BART.measurements import encoder_inference, decoder_inference, full_inference_greedy, full_inference_beam\n",
- "from BART.export import BARTEncoderTorchFile, BARTDecoderTorchFile, BARTEncoderONNXFile, BARTDecoderONNXFile, BARTEncoderTRTEngine, BARTDecoderTRTEngine\n",
- "from BART.trt import BARTTRTEncoder, BARTTRTDecoder\n",
- "\n",
- "# NNDF\n",
- "from NNDF.networks import NetworkMetadata, Precision\n",
- "from NNDF.networks import TimingProfile\n",
- "from NNDF.general_utils import measure_python_inference_code\n",
- "from NNDF.torch_utils import expand_inputs_for_beam_search"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "af4254e2-11fd-4bc7-ac0b-60b1a9e07c4e",
- "metadata": {
- "tags": []
- },
- "source": [
- "\n",
- "\n",
- "## 1. Download HuggingFace BART model\n",
- "\n",
- "First, we download the original HuggingFace PyTorch BART model from HuggingFace model hubs, together with its associated tokernizer.\n",
- "\n",
- "The BART variants that are suported by TensorRT are: facebook/bart-base (139M), facebook/bart-large (406M), facebook/bart-large-cnn (406M), facebook/mbart-large-50 (680M)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "6a14eabc-d863-454d-9078-849acc857bb0",
- "metadata": {
- "tags": []
- },
- "source": [
- "### Model and Inference Configuration"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "774c89f3-7dbb-423d-88b2-1de693324389",
- "metadata": {},
- "outputs": [],
- "source": [
- "# UI\n",
- "model_widget = widgets.Select(\n",
- " options=['facebook/bart-base', 'facebook/bart-large', 'facebook/bart-large-cnn', 'facebook/mbart-large-50'],\n",
- " value='facebook/bart-base',\n",
- " description='Model variant:',\n",
- " disabled=False,\n",
- " style=widget_style,\n",
- " layout=widget_layout\n",
- ")\n",
- "display(model_widget)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "ed04130e-7f20-4a3e-bf76-52aa335f402d",
- "metadata": {},
- "outputs": [],
- "source": [
- "BART_VARIANT = model_widget.value\n",
- "\n",
- "disable_preview_dynamic_feature_widget = widgets.Checkbox(\n",
- " value=False,\n",
- " description='Disable 8.6 EA faster dynamic shapes feature',\n",
- " disabled=False,\n",
- " indent=False,\n",
- " style=widget_style,\n",
- " layout=widget_layout\n",
- ")\n",
- "\n",
- "FP16_widget = widgets.Checkbox(\n",
- " value=False,\n",
- " description='FP16',\n",
- " disabled=False,\n",
- " indent=False,\n",
- " style=widget_style,\n",
- " layout=widget_layout\n",
- ")\n",
- "\n",
- "HF_KV_widget = widgets.Checkbox(\n",
- " value=True,\n",
- " description='HuggingFace KV cache',\n",
- " disabled=False,\n",
- " indent=False,\n",
- " style=widget_style,\n",
- " layout=widget_layout\n",
- ")\n",
- "\n",
- "TRT_KV_widget = widgets.Checkbox(\n",
- " value=False,\n",
- " description='TensorRT KV cache (disabled due to performance improvements in progress, not beating non-KV version yet)', # \n",
- " disabled=True,\n",
- " indent=False,\n",
- " style=widget_style,\n",
- " layout=widget_layout\n",
- ")\n",
- "\n",
- "KV_widgets = widgets.HBox([HF_KV_widget,TRT_KV_widget])\n",
- "\n",
- "batch_size_widget = widgets.BoundedIntText(\n",
- " value=1,\n",
- " min=1,\n",
- " max=100000,\n",
- " step=1,\n",
- " description='Batch size',\n",
- " disabled=False,\n",
- " style=widget_style,\n",
- " layout=widget_layout\n",
- ")\n",
- "\n",
- "max_input_len_widget = widgets.BoundedIntText(\n",
- " value=BARTModelTRTConfig.MAX_SEQUENCE_LENGTH[BART_VARIANT],\n",
- " min=1,\n",
- " max=100000,\n",
- " step=1,\n",
- " description='Max input length',\n",
- " disabled=False,\n",
- " style=widget_style,\n",
- " layout=widget_layout\n",
- ")\n",
- "\n",
- "min_output_len_widget = widgets.BoundedIntText(\n",
- " value=BARTModelTRTConfig.MIN_OUTPUT_LENGTH[BART_VARIANT],\n",
- " min=0,\n",
- " max=100000,\n",
- " step=1,\n",
- " description='Min output length',\n",
- " disabled=False,\n",
- " style=widget_style,\n",
- " layout=widget_layout\n",
- ")\n",
- "\n",
- "max_output_len_widget = widgets.BoundedIntText(\n",
- " value=BARTModelTRTConfig.MAX_OUTPUT_LENGTH[BART_VARIANT],\n",
- " min=1,\n",
- " max=100000,\n",
- " step=1,\n",
- " description='Max output length',\n",
- " disabled=False,\n",
- " style=widget_style,\n",
- " layout=widget_layout\n",
- ")\n",
- "\n",
- "encoder_hidden_size_widget = widgets.BoundedIntText(\n",
- " value=BARTModelTRTConfig.ENCODER_HIDDEN_SIZE[BART_VARIANT],\n",
- " min=1,\n",
- " max=100000,\n",
- " step=1,\n",
- " description='Encoder hidden size',\n",
- " disabled=False,\n",
- " style=widget_style,\n",
- " layout=widget_layout\n",
- ")\n",
- "\n",
- "num_beam_widget = widgets.BoundedIntText(\n",
- " value=1,\n",
- " min=1,\n",
- " max=100000,\n",
- " step=1,\n",
- " description='Number of beams',\n",
- " disabled=False,\n",
- " style=widget_style,\n",
- " layout=widget_layout\n",
- ")\n",
- "\n",
- "widgets_all = widgets.VBox([\n",
- " FP16_widget, \n",
- " disable_preview_dynamic_feature_widget,\n",
- " KV_widgets,\n",
- " batch_size_widget, \n",
- " max_input_len_widget,\n",
- " min_output_len_widget,\n",
- " max_output_len_widget, \n",
- " encoder_hidden_size_widget,\n",
- " num_beam_widget\n",
- "])\n",
- "\n",
- "display(widgets_all)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "077dd494-e8d8-42f9-bdbd-0362f1213118",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Inference config\n",
- "FP16 = FP16_widget.value # flag to use FP16 precision in PyTorch & TRT\n",
- "disable_preview_dynamic_shapes = disable_preview_dynamic_feature_widget.value # flag to disable 8.5 EA feature\n",
- "HF_KV = HF_KV_widget.value # flag to use KV cache in HF\n",
- "TRT_KV = TRT_KV_widget.value # flag to use KV cache in TRT\n",
- "\n",
- "# Model config\n",
- "batch_size = batch_size_widget.value\n",
- "max_input_len = max_input_len_widget.value\n",
- "min_output_len = min_output_len_widget.value\n",
- "max_output_len = max_output_len_widget.value\n",
- "encoder_hidden_size = encoder_hidden_size_widget.value\n",
- "num_beams = num_beam_widget.value\n",
- "\n",
- "# Benchmark config\n",
- "# `TimingProfile` is a named tuple that specifies the number of experiments and number of times to call the function per iteration and number of warm-up calls, oercentiles, etc.\n",
- "timing_profile = TimingProfile(iterations=10, number=1, warmup=1, duration=0, percentile=[50,99])\n",
- "\n",
- "def percentile_print(timing):\n",
- " return ', '.join(['p{} {:.2f}ms'.format(timing_profile.percentile[i], p*1000) for i,p in enumerate(timing)])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "fae66d58-f994-4987-8f1d-1fa8ac2ec8b4",
- "metadata": {},
- "outputs": [],
- "source": [
- "# mbart variant can't be recognized by HF AutoClass yet\n",
- "if \"mbart\" not in BART_VARIANT: \n",
- " bart_model = AutoModelForPreTraining.from_pretrained(BART_VARIANT) # BartForConditionalGeneration\n",
- " tokenizer = AutoTokenizer.from_pretrained(BART_VARIANT) # BartTokenizer\n",
- "else:\n",
- " from transformers import MBartForConditionalGeneration, MBart50Tokenizer\n",
- " bart_model = MBartForConditionalGeneration.from_pretrained(BART_VARIANT)\n",
- " tokenizer = MBart50Tokenizer.from_pretrained(BART_VARIANT, src_lang=\"en_XX\")\n",
- "\n",
- "config = AutoConfig.from_pretrained(BART_VARIANT)\n",
- "\n",
- "bart_model = bart_model.to('cuda').eval()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "7252ca90-1104-40dc-8e72-f51c07a4cd11",
- "metadata": {},
- "outputs": [],
- "source": [
- "# save model locally\n",
- "pytorch_model_dir = './models/{}/pytorch'.format(BART_VARIANT)\n",
- "!mkdir -p $pytorch_model_dir\n",
- "\n",
- "if os.path.exists(pytorch_model_dir) and len(os.listdir(pytorch_model_dir)) != 0:\n",
- " print('PyTorch model already exists. Skipping...')\n",
- "else:\n",
- " bart_model.save_pretrained(pytorch_model_dir)\n",
- " print(\"PyTorch model saved to {}\".format(pytorch_model_dir))"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "8e4d1d6e-1cad-43a2-a8c3-4bc221070dc2",
- "metadata": {
- "tags": []
- },
- "source": [
- "### Test Input Data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "fd1d0d09-be28-42a3-9135-46b796e5be79",
- "metadata": {},
- "outputs": [],
- "source": [
- "# input sequence\n",
- "inputs = \"NVIDIA TensorRT-based applications perform up to 36X faster than CPU-only platforms during inference, enabling developers to optimize neural network models trained on all major frameworks, calibrate for lower precision with high accuracy, and deploy to hyperscale data centers, embedded platforms, or automotive product platforms. TensorRT, built on the NVIDIA CUDA parallel programming model, enables developers to optimize inference by leveraging libraries, development tools, and technologies in CUDA-X for AI, autonomous machines, high performance computing, and graphics. With new NVIDIA Ampere Architecture GPUs, TensorRT also uses sparse tensor cores for an additional performance boost.\"\n",
- "\n",
- "input_ids = tokenizer(inputs, padding=True, return_tensors=\"pt\").input_ids.to('cuda')"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "11ea023d-c4d4-43bb-9d77-c76684e0b06f",
- "metadata": {
- "tags": []
- },
- "source": [
- "\n",
- "\n",
- "## 2. PyTorch HuggingFace Inference\n",
- "\n",
- "Next, we will carry out inference with the HuggingFace PyTorch model as a baseline."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "fdb1d921-db47-4c45-bdcc-08ccc500ad99",
- "metadata": {},
- "source": [
- "### End-to-End HuggingFace Inference"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "10168132",
- "metadata": {},
- "outputs": [],
- "source": [
- "# WAR: Using an ugly representation because cuda 11.4 does not support GPU models due to cublas errors\n",
- "cuda_114_mode = \"cuda-11.4\" in os.environ[\"LD_LIBRARY_PATH\"]\n",
- "if cuda_114_mode:\n",
- " bart_model = bart_model.cpu()\n",
- " input_ids = input_ids.cpu()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "d886e29a-1d1d-49e0-a351-3e4418f4bf28",
- "metadata": {},
- "outputs": [],
- "source": [
- "# encoder-decoder inference \n",
- "with torch.no_grad():\n",
- " output_ids = bart_model.generate(input_ids, max_length=max_output_len, min_length=min_output_len, num_beams=num_beams, use_cache=False) \n",
- " outputs = tokenizer.decode(output_ids[-1,:], skip_special_tokens=True) \n",
- "outputs_hf = outputs\n",
- "\n",
- "# timing\n",
- "# FP32\n",
- "bart_model.float()\n",
- "hf_nonkv_time = measure_python_inference_code(lambda: bart_model.generate(input_ids, max_length=max_output_len, min_length=min_output_len, num_beams=num_beams, use_cache=False), timing_profile)\n",
- "hf_kv_time = measure_python_inference_code(lambda: bart_model.generate(input_ids, max_length=max_output_len, min_length=min_output_len, num_beams=num_beams, use_cache=True), timing_profile)\n",
- "\n",
- "# FP16, cuda 11.4 has cublas error that will fail in both cpu or cpu model for BART\n",
- "if not cuda_114_mode:\n",
- " bart_model.half()\n",
- "hf_nonkv_time_fp16 = measure_python_inference_code(lambda: bart_model.generate(input_ids, max_length=max_output_len, min_length=min_output_len, num_beams=num_beams, use_cache=False), timing_profile)\n",
- "hf_kv_time_fp16 = measure_python_inference_code(lambda: bart_model.generate(input_ids, max_length=max_output_len, min_length=min_output_len, num_beams=num_beams, use_cache=True), timing_profile)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "dab5c682-049a-48b3-830c-e1eecccbd553",
- "metadata": {},
- "outputs": [],
- "source": [
- "# print results and timing statistics\n",
- "print(f'Input length: {input_ids.size(1)}')\n",
- "print(inputs)\n",
- "print('\\n') \n",
- "print(f'Output length: {output_ids[-1,:].size(0)}')\n",
- "print(outputs_hf)\n",
- "print('\\n') \n",
- "print(f'Device: {torch.cuda.get_device_name()}')\n",
- "print(f\"Precision: FP32, Number of Beams: {num_beams}\")\n",
- "print(f\"HF time (no KV cache): {percentile_print(hf_nonkv_time)}\")\n",
- "print(f\"HF time (w/ KV cache): {percentile_print(hf_kv_time)}\")\n",
- "print(f\"Precision: FP16, Number of Beams: {num_beams}\")\n",
- "print(f\"HF time (no KV cache): {percentile_print(hf_nonkv_time_fp16)}\")\n",
- "print(f\"HF time (w/ KV cache): {percentile_print(hf_kv_time_fp16)}\")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "667fcacc-02cb-415d-a9ff-2d2ec44ef225",
- "metadata": {
- "tags": []
- },
- "source": [
- "### Time Measurement of Encoder, Decoder, and Full E2E\n",
- "For benchmarking purposes, we will employ helper functions `encoder_inference`, `decoder_inference`, and `full_inference_greedy` which execute the inference repeatedly for the BART encoder and decoder stacks separately as well as end-to-end for the entire output sequence, and measure the execution time. These execution times can be later on compared with TensorRT counterpart to demonstrate the speedup. \n",
- "\n",
- "Encoder and decoder of BART are wrapped as standalone PyTorch module for testing."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "2c07516f-b02b-4722-b0bd-06b632259702",
- "metadata": {},
- "outputs": [],
- "source": [
- "# FP32\n",
- "bart_model.float()\n",
- "bart_torch_encoder = BARTEncoderTorchFile.TorchModule(bart_model.get_encoder())\n",
- "bart_torch_decoder = BARTDecoderTorchFile.TorchModule(bart_model.get_decoder(), bart_model.lm_head, bart_model.final_logits_bias, bart_model.config)\n",
- "\n",
- "with torch.no_grad():\n",
- "\n",
- " encoder_last_hidden_state, encoder_pytorch_time = encoder_inference(bart_torch_encoder, input_ids, timing_profile)\n",
- " _, decoder_pytorch_time = decoder_inference(bart_torch_decoder, expand_inputs_for_beam_search(input_ids, num_beams) if num_beams > 1 else input_ids, expand_inputs_for_beam_search(encoder_last_hidden_state, num_beams) if num_beams > 1 else encoder_last_hidden_state, timing_profile, use_cache=HF_KV)\n",
- " if num_beams == 1:\n",
- " output_ids, full_pytorch_time = full_inference_greedy(bart_torch_encoder,bart_torch_decoder,input_ids,tokenizer,timing_profile,max_length=max_output_len, min_length=min_output_len, use_cache=HF_KV)\n",
- " else:\n",
- " output_ids, full_pytorch_time = full_inference_beam(bart_torch_encoder,bart_torch_decoder,input_ids,tokenizer,timing_profile,num_beams=num_beams,max_length=max_output_len, min_length=min_output_len, use_cache=HF_KV)\n",
- " outputs = tokenizer.decode(output_ids[0], skip_special_tokens=True) \n",
- "\n",
- "outputs_pytorch = outputs\n",
- "\n",
- "# FP16\n",
- "if not cuda_114_mode:\n",
- " bart_model.half()\n",
- "else:\n",
- " print(\"CUDA 11.4 is incompatible with current PyTorch version, using fp32 instead of fp16\")\n",
- "bart_torch_encoder_fp16 = BARTEncoderTorchFile.TorchModule(bart_model.get_encoder())\n",
- "bart_torch_decoder_fp16 = BARTDecoderTorchFile.TorchModule(bart_model.get_decoder(), bart_model.lm_head, bart_model.final_logits_bias, bart_model.config)\n",
- "\n",
- "with torch.no_grad():\n",
- "\n",
- " encoder_last_hidden_state, encoder_pytorch_time_fp16 = encoder_inference(bart_torch_encoder_fp16, input_ids, timing_profile)\n",
- " _, decoder_pytorch_time_fp16 = decoder_inference(bart_torch_decoder_fp16, expand_inputs_for_beam_search(input_ids, num_beams) if num_beams > 1 else input_ids, expand_inputs_for_beam_search(encoder_last_hidden_state, num_beams) if num_beams > 1 else encoder_last_hidden_state, timing_profile, use_cache=HF_KV)\n",
- " if num_beams == 1:\n",
- " output_ids_fp16, full_pytorch_time_fp16 = full_inference_greedy(bart_torch_encoder_fp16,bart_torch_decoder_fp16,input_ids,tokenizer,timing_profile,max_length=max_output_len, min_length=min_output_len, use_cache=HF_KV)\n",
- " else:\n",
- " output_ids_fp16, full_pytorch_time_fp16 = full_inference_beam(bart_torch_encoder_fp16,bart_torch_decoder_fp16,input_ids,tokenizer,timing_profile,num_beams=num_beams,max_length=max_output_len, min_length=min_output_len, use_cache=HF_KV)\n",
- " outputs_fp16 = tokenizer.decode(output_ids_fp16[0], skip_special_tokens=True) \n",
- "\n",
- "outputs_pytorch_fp16 = outputs_fp16"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "a103e3a6-920b-4c97-818e-6140654abc5e",
- "metadata": {},
- "outputs": [],
- "source": [
- "# print\n",
- "print(f'PyTorch FP32 Output identical to HF results? {outputs_pytorch == outputs_hf}')\n",
- "print(f'PyTorch FP16 Output identical to HF results? {outputs_pytorch_fp16 == outputs_hf}')\n",
- "print('\\n') \n",
- "print(f'Device: {torch.cuda.get_device_name()}')\n",
- "print(f\"Precision: FP32, Number of Beams: {num_beams}\")\n",
- "print(f\"Encoder time: {percentile_print(encoder_pytorch_time)}\")\n",
- "print(f\"Decoder time: {percentile_print(decoder_pytorch_time)}\")\n",
- "print(f\"Full E2E time: {percentile_print(full_pytorch_time)}\")\n",
- "print(f\"Precision: FP16, Number of Beams: {num_beams}\")\n",
- "print(f\"Encoder time: {percentile_print(encoder_pytorch_time_fp16)}\")\n",
- "print(f\"Decoder time: {percentile_print(decoder_pytorch_time_fp16)}\")\n",
- "print(f\"Full E2E time: {percentile_print(full_pytorch_time_fp16)}\")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "0d662701-e430-4fdc-ad46-1f296defcf8f",
- "metadata": {
- "tags": []
- },
- "source": [
- "\n",
- "\n",
- "## 3. TensorRT Engine Building\n",
- "\n",
- "### Convert PyTorch to ONNX\n",
- "\n",
- "Prior to converting the model to a TensorRT engine, we will first convert the PyTorch model to an intermediate universal format.\n",
- "\n",
- "ONNX is an open format for machine learning and deep learning models. It allows you to convert deep learning and machine learning models from different frameworks such as TensorFlow, PyTorch, MATLAB, Caffe, and Keras to a single format.\n",
- "\n",
- "The steps to convert a PyTorch model to TensorRT are as follows:\n",
- "- Convert the pretrained PyTorch model into ONNX.\n",
- "- Import the ONNX model into TensorRT, apply optimizations and generate a TensorRT engine.\n",
- "- Perform inference on the GPU using the engine. \n",
- "\n",
- "For the BART model, we will convert the encoder and decoder to ONNX and build each engine seperately. The logistics of this separate building approach come from the nature of sequence-to-sequence models. BART and T5 are good examples of sequence-to-sequence models which use encoder-decoder architecture. The encoder is only executed once on the input and generates hidden states. Next, the decoder is executed repeatedly in an auto-regressive manner until the entire output finishes generating, i.e. the output sequence length is the number of times the decoder runs. The most efficient way to run encoder-decoder models with TensorRT is to have two separate engines."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "4ea48be5-1dae-4e93-92a4-840d7017ad9b",
- "metadata": {},
- "outputs": [],
- "source": [
- "onnx_model_path = './models/{}/onnx'.format(BART_VARIANT)\n",
- "!mkdir -p $onnx_model_path\n",
- "\n",
- "# FP32\n",
- "bart_model.float()\n",
- "metadata = NetworkMetadata(variant=BART_VARIANT, precision=Precision(fp16=False), other=BARTMetadata(kv_cache=TRT_KV))\n",
- "trt_config = BARTModelTRTConfig()\n",
- "metadata_string = trt_config.get_metadata_string(metadata)\n",
- "\n",
- "encoder_onnx_model_fpath = metadata_string + \"-encoder.onnx\"\n",
- "decoder_onnx_model_fpath = metadata_string + \"-decoder-with-lm-head.onnx\"\n",
- "\n",
- "# for onnx conversion, ensure model is on CPU and FP32 precision in this step\n",
- "bart_torchfile_encoder = BARTEncoderTorchFile(bart_model.to('cpu'), metadata)\n",
- "bart_torchfile_decoder = BARTDecoderTorchFile(bart_model.to('cpu'), metadata)\n",
- "\n",
- "onnx_bart_encoder = bart_torchfile_encoder.as_onnx_model(os.path.join(onnx_model_path, encoder_onnx_model_fpath), force_overwrite=False)\n",
- "onnx_bart_decoder = bart_torchfile_decoder.as_onnx_model(os.path.join(onnx_model_path, decoder_onnx_model_fpath), force_overwrite=False)\n",
- "\n",
- "# FP16\n",
- "metadata_fp16 = NetworkMetadata(variant=BART_VARIANT, precision=Precision(fp16=True), other=BARTMetadata(kv_cache=TRT_KV))\n",
- "trt_config_fp16 = BARTModelTRTConfig()\n",
- "metadata_string_fp16 = trt_config.get_metadata_string(metadata_fp16)\n",
- "\n",
- "encoder_onnx_model_fpath_fp16 = metadata_string_fp16 + \"-encoder.onnx\"\n",
- "decoder_onnx_model_fpath_fp16 = metadata_string_fp16 + \"-decoder-with-lm-head.onnx\"\n",
- "\n",
- "# for onnx conversion, ensure model is on CPU and FP32 precision in this step\n",
- "bart_torchfile_encoder = BARTEncoderTorchFile(bart_model.to('cpu'), metadata)\n",
- "bart_torchfile_decoder = BARTDecoderTorchFile(bart_model.to('cpu'), metadata)\n",
- "\n",
- "onnx_bart_encoder_fp16 = bart_torchfile_encoder.as_onnx_model(os.path.join(onnx_model_path, encoder_onnx_model_fpath_fp16), force_overwrite=False)\n",
- "onnx_bart_decoder_fp16 = bart_torchfile_decoder.as_onnx_model(os.path.join(onnx_model_path, decoder_onnx_model_fpath_fp16), force_overwrite=False)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "7baf007e-5508-485c-a87f-9bfe16260452",
- "metadata": {},
- "source": [
- "### Convert ONNX to TensorRT\n",
- "\n",
- "Now we are ready to parse the ONNX encoder and decoder models and convert them to optimized TensorRT engines.\n",
- "\n",
- "Since the models contains dynamic input shapes, we can specify a valid input range with a TensorRT optimization profile."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "6bd6e3fc-6797-46b0-a211-ce42d3769105",
- "metadata": {},
- "outputs": [],
- "source": [
- "tensorrt_model_path = './models/{}/tensorrt'.format(BART_VARIANT)\n",
- "!mkdir -p $tensorrt_model_path\n",
- "\n",
- "# Encoder optimization profiles\n",
- "encoder_profile = Profile()\n",
- "encoder_profile.add(\n",
- " \"input_ids\",\n",
- " min=(batch_size, 1),\n",
- " opt=(batch_size, max_input_len // 2),\n",
- " max=(batch_size, max_input_len),\n",
- ")\n",
- "\n",
- "# Decoder optimization profiles\n",
- "decoder_profile = Profile()\n",
- "decoder_profile.add(\n",
- " \"input_ids\",\n",
- " min=(batch_size * num_beams, 1),\n",
- " opt=(batch_size * num_beams, max_output_len // 2),\n",
- " max=(batch_size * num_beams, max_output_len),\n",
- ")\n",
- "decoder_profile.add(\n",
- " \"encoder_hidden_states\",\n",
- " min=(batch_size * num_beams, 1, encoder_hidden_size),\n",
- " opt=(batch_size * num_beams, max_input_len // 2, encoder_hidden_size),\n",
- " max=(batch_size * num_beams, max_input_len, encoder_hidden_size),\n",
- ")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "aa5738ff-790e-47a0-ba03-27af87742646",
- "metadata": {},
- "outputs": [],
- "source": [
- "engine_tag = f\"bs{batch_size}\"\n",
- "\n",
- "if num_beams > 1:\n",
- " engine_tag += \"-beam{}\".format(num_beams)\n",
- "\n",
- "preview_features = [PreviewFeature.DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]\n",
- "if disable_preview_dynamic_shapes:\n",
- " engine_tag += \"-noPreviewFasterDynamicShapes\"\n",
- "else:\n",
- " preview_features.append(PreviewFeature.FASTER_DYNAMIC_SHAPES_0805)\n",
- "\n",
- "# FP32\n",
- "encoder_engine_name = os.path.join(tensorrt_model_path, encoder_onnx_model_fpath) + f\"-{engine_tag}.engine\".replace(f\"-beam{num_beams}\", \"\") # encoder engine not affected by beam search\n",
- "decoder_engine_name = os.path.join(tensorrt_model_path, decoder_onnx_model_fpath) + f\"-{engine_tag}.engine\"\n",
- "\n",
- "if not os.path.exists(encoder_engine_name):\n",
- " bart_trt_encoder_engine = BARTEncoderONNXFile(os.path.join(onnx_model_path, encoder_onnx_model_fpath), metadata).as_trt_engine(\n",
- " encoder_engine_name, \n",
- " profiles=[encoder_profile], \n",
- " preview_features=preview_features\n",
- " )\n",
- "else:\n",
- " bart_trt_encoder_engine = BARTEncoderTRTEngine(encoder_engine_name, metadata)\n",
- " \n",
- "if not os.path.exists(decoder_engine_name):\n",
- " bart_trt_decoder_engine = BARTDecoderONNXFile(os.path.join(onnx_model_path, decoder_onnx_model_fpath), metadata).as_trt_engine(\n",
- " decoder_engine_name, \n",
- " profiles=[decoder_profile], \n",
- " preview_features=preview_features\n",
- " )\n",
- "else:\n",
- " bart_trt_decoder_engine = BARTDecoderTRTEngine(decoder_engine_name, metadata)\n",
- "\n",
- "# FP16\n",
- "encoder_engine_name_fp16 = os.path.join(tensorrt_model_path, encoder_onnx_model_fpath_fp16) + f\"-{engine_tag}.engine\".replace(f\"-beam{num_beams}\", \"\") # encoder engine not affected by beam search\n",
- "decoder_engine_name_fp16 = os.path.join(tensorrt_model_path, decoder_onnx_model_fpath_fp16) + f\"-{engine_tag}.engine\"\n",
- "\n",
- "if not os.path.exists(encoder_engine_name_fp16):\n",
- " bart_trt_encoder_engine_fp16 = BARTEncoderONNXFile(os.path.join(onnx_model_path, encoder_onnx_model_fpath_fp16), metadata_fp16).as_trt_engine(\n",
- " encoder_engine_name_fp16, \n",
- " profiles=[encoder_profile], \n",
- " preview_features=preview_features\n",
- " )\n",
- "else:\n",
- " bart_trt_encoder_engine_fp16 = BARTEncoderTRTEngine(encoder_engine_name_fp16, metadata_fp16)\n",
- " \n",
- "if not os.path.exists(decoder_engine_name_fp16):\n",
- " bart_trt_decoder_engine_fp16 = BARTDecoderONNXFile(os.path.join(onnx_model_path, decoder_onnx_model_fpath_fp16), metadata_fp16).as_trt_engine(\n",
- " decoder_engine_name_fp16, \n",
- " profiles=[decoder_profile], \n",
- " preview_features=preview_features\n",
- " )\n",
- "else:\n",
- " bart_trt_decoder_engine_fp16 = BARTDecoderTRTEngine(decoder_engine_name_fp16, metadata_fp16)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "74f7f6fc-1e6a-4ddc-8e9b-543d9e8dab4d",
- "metadata": {},
- "source": [
- "\n",
- "\n",
- "## 4. TensorRT Inference\n",
- "\n",
- "Great, if you have reached this stage, it means we now have successfully built optimized TensorRT engines for the BART model, ready for us to carry out inference. The BART model with TensorRT backend can now be employed in place of the original HuggingFace BART model."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "3954f2f4-c393-463b-a44b-3e5335032b57",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Initialize TensorRT engines\n",
- "trt_config = AutoConfig.from_pretrained(BART_VARIANT, use_cache = metadata.other.kv_cache)\n",
- "\n",
- "# FP32\n",
- "bart_trt_encoder = BARTTRTEncoder(bart_trt_encoder_engine, metadata, trt_config, batch_size=batch_size)\n",
- "bart_trt_decoder = BARTTRTDecoder(bart_trt_decoder_engine, metadata, trt_config, batch_size=batch_size, num_beams=num_beams)\n",
- "\n",
- "# FP16\n",
- "bart_trt_encoder_fp16 = BARTTRTEncoder(bart_trt_encoder_engine_fp16, metadata_fp16, trt_config, batch_size=batch_size)\n",
- "bart_trt_decoder_fp16 = BARTTRTDecoder(bart_trt_decoder_engine_fp16, metadata_fp16, trt_config, batch_size=batch_size, num_beams=num_beams)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "f7025246-4f14-4449-bb93-6c1566f48773",
- "metadata": {},
- "source": [
- "### End-to-End TensorRT Inference"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "92a5bbfe-a576-4a94-99d1-f0862b31fdb4",
- "metadata": {},
- "outputs": [],
- "source": [
- "from transformers.generation_logits_process import (\n",
- " NoRepeatNGramLogitsProcessor,\n",
- " MinLengthLogitsProcessor,\n",
- " ForcedBOSTokenLogitsProcessor,\n",
- " ForcedEOSTokenLogitsProcessor,\n",
- " LogitsProcessorList,\n",
- ")\n",
- "from transformers.generation_stopping_criteria import (\n",
- " MaxLengthCriteria,\n",
- " StoppingCriteriaList,\n",
- ")\n",
- "from transformers.generation_beam_search import (\n",
- " BeamSearchScorer,\n",
- ")\n",
- "\n",
- "stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_output_len)])\n",
- "no_repeat_ngram_size = BARTModelTRTConfig.NO_REPEAT_NGRAM_SIZE\n",
- "min_length = BARTModelTRTConfig.MIN_OUTPUT_LENGTH[BART_VARIANT]\n",
- "logits_processor = LogitsProcessorList([\n",
- " NoRepeatNGramLogitsProcessor(no_repeat_ngram_size), \n",
- " MinLengthLogitsProcessor(min_length, tokenizer.convert_tokens_to_ids(tokenizer.eos_token)),\n",
- " ForcedBOSTokenLogitsProcessor(tokenizer.convert_tokens_to_ids(tokenizer.bos_token)),\n",
- " ForcedEOSTokenLogitsProcessor(max_output_len, tokenizer.convert_tokens_to_ids(tokenizer.eos_token))\n",
- "]) # by checking HuggingFace's generate() implementation carefully, the default logits processor for BART has no_repeat_ngram_size = 3 and forced_eos_token_id = 2. In this way we can ensure identical results with raw HuggingFace\n",
- "\n",
- "decoder_initial_input = torch.full(\n",
- " (batch_size, 1), tokenizer.convert_tokens_to_ids(tokenizer.eos_token), dtype=torch.int32\n",
- ").to('cuda')\n",
- "\n",
- "if num_beams > 1:\n",
- " decoder_initial_input = expand_inputs_for_beam_search(decoder_initial_input, expand_size=num_beams)\n",
- " \n",
- "# FP32\n",
- "def e2e_trt():\n",
- " with torch.no_grad():\n",
- " encoder_last_hidden_states = bart_trt_encoder(input_ids=input_ids)\n",
- " \n",
- " if num_beams > 1:\n",
- " # prepare input for beam search\n",
- " encoder_last_hidden_states = expand_inputs_for_beam_search(encoder_last_hidden_states, expand_size=num_beams)\n",
- "\n",
- " # beam scorer must be reset before each beam search run, otherwise beam search will be skipped due to scorer cache\n",
- " beam_scorer = BeamSearchScorer(\n",
- " batch_size=batch_size,\n",
- " num_beams=num_beams,\n",
- " device=\"cuda\",\n",
- " do_early_stopping=True,\n",
- " )\n",
- " \n",
- " bart_trt_decoder.set_encoder_hidden_states_for_inference_cycle(encoder_last_hidden_states)\n",
- " \n",
- " if num_beams == 1:\n",
- " decoder_output = bart_trt_decoder.greedy_search(\n",
- " input_ids=decoder_initial_input,\n",
- " encoder_hidden_states=encoder_last_hidden_states,\n",
- " stopping_criteria=stopping_criteria,\n",
- " logits_processor=logits_processor,\n",
- " use_cache=metadata.other.kv_cache,\n",
- " use_cuda=True\n",
- " )\n",
- " else:\n",
- " decoder_output = bart_trt_decoder.beam_search(\n",
- " input_ids=decoder_initial_input,\n",
- " beam_scorer=beam_scorer,\n",
- " encoder_hidden_states=encoder_last_hidden_states,\n",
- " stopping_criteria=stopping_criteria,\n",
- " logits_processor=logits_processor,\n",
- " use_cache=metadata.other.kv_cache,\n",
- " use_cuda=True\n",
- " )\n",
- " return decoder_output\n",
- "\n",
- "output_ids = e2e_trt()\n",
- "outputs_trt = tokenizer.decode(output_ids[0], skip_special_tokens=True)\n",
- "trt_time = measure_python_inference_code(e2e_trt, timing_profile)\n",
- "\n",
- "# FP16\n",
- "def e2e_trt_fp16():\n",
- " with torch.no_grad():\n",
- " encoder_last_hidden_states = bart_trt_encoder_fp16(input_ids=input_ids)\n",
- " \n",
- " if num_beams > 1:\n",
- " # prepare input for beam search\n",
- " encoder_last_hidden_states = expand_inputs_for_beam_search(encoder_last_hidden_states, expand_size=num_beams)\n",
- " \n",
- " # beam scorer must be reset before each beam search run, otherwise beam search will be skipped due to scorer cache\n",
- " beam_scorer = BeamSearchScorer(\n",
- " batch_size=batch_size,\n",
- " num_beams=num_beams,\n",
- " device=\"cuda\",\n",
- " do_early_stopping=True,\n",
- " )\n",
- " \n",
- " bart_trt_decoder_fp16.set_encoder_hidden_states_for_inference_cycle(encoder_last_hidden_states)\n",
- " \n",
- " if num_beams == 1:\n",
- " decoder_output = bart_trt_decoder_fp16.greedy_search(\n",
- " input_ids=decoder_initial_input,\n",
- " encoder_hidden_states=encoder_last_hidden_states,\n",
- " stopping_criteria=stopping_criteria,\n",
- " logits_processor=logits_processor,\n",
- " use_cache=metadata.other.kv_cache,\n",
- " use_cuda=True\n",
- " )\n",
- " else:\n",
- " decoder_output = bart_trt_decoder_fp16.beam_search(\n",
- " input_ids=decoder_initial_input,\n",
- " beam_scorer=beam_scorer,\n",
- " encoder_hidden_states=encoder_last_hidden_states,\n",
- " stopping_criteria=stopping_criteria,\n",
- " logits_processor=logits_processor,\n",
- " use_cache=metadata.other.kv_cache,\n",
- " use_cuda=True\n",
- " )\n",
- " return decoder_output\n",
- "\n",
- "output_ids_fp16 = e2e_trt_fp16()\n",
- "outputs_trt_fp16 = tokenizer.decode(output_ids_fp16[0], skip_special_tokens=True)\n",
- "trt_time_fp16 = measure_python_inference_code(e2e_trt_fp16, timing_profile)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "6198afcf-70d1-46ef-a515-dcf5ea4c17b6",
- "metadata": {},
- "outputs": [],
- "source": [
- "# print results and timing statistics\n",
- "print(f'Device: {torch.cuda.get_device_name()}')\n",
- "print(f\"Using engine: {metadata_string + '-' + engine_tag}\") \n",
- "print(f'Output identical to HF results? {outputs_trt == outputs_hf}')\n",
- "print(f\"Precision: FP32\")\n",
- "print(f'TRT time: {percentile_print(trt_time)}')\n",
- "print()\n",
- "print(f\"Using engine: {metadata_string_fp16 + '-' + engine_tag}\") \n",
- "print(f'Output identical to HF results? {outputs_trt_fp16 == outputs_hf}')\n",
- "print(f\"Precision: FP16\")\n",
- "print(f'TRT time: {percentile_print(trt_time_fp16)}')"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "ed9d4a98-b034-470e-a9f8-096d4100b8d4",
- "metadata": {},
- "source": [
- "### Time Measurement of Encoder, Decoder, and Full E2E\n",
- "We will benchmark the encoder, decoder, and full end-to-end as we did for HuggingFace before."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "2320e4bf-94f2-40d8-9a86-3a1ea352fca2",
- "metadata": {},
- "outputs": [],
- "source": [
- "# FP32\n",
- "encoder_last_hidden_states, encoder_trt_time = encoder_inference(bart_trt_encoder, input_ids, timing_profile)\n",
- "_, decoder_trt_time = decoder_inference(bart_trt_decoder, expand_inputs_for_beam_search(input_ids, num_beams) if num_beams > 1 else input_ids, expand_inputs_for_beam_search(encoder_last_hidden_states, num_beams) if num_beams > 1 else encoder_last_hidden_states, timing_profile)\n",
- "\n",
- "if num_beams == 1:\n",
- " _, full_trt_time = full_inference_greedy(\n",
- " bart_trt_encoder,\n",
- " bart_trt_decoder,\n",
- " input_ids,\n",
- " tokenizer,\n",
- " timing_profile,\n",
- " max_length=max_output_len,\n",
- " min_length=BARTModelTRTConfig.MIN_OUTPUT_LENGTH[metadata.variant],\n",
- " batch_size=batch_size,\n",
- " use_cache=metadata.other.kv_cache,\n",
- " )\n",
- "else:\n",
- " _, full_trt_time = full_inference_beam(\n",
- " bart_trt_encoder,\n",
- " bart_trt_decoder,\n",
- " input_ids,\n",
- " tokenizer,\n",
- " timing_profile,\n",
- " num_beams=num_beams,\n",
- " max_length=max_output_len,\n",
- " min_length=BARTModelTRTConfig.MIN_OUTPUT_LENGTH[metadata.variant],\n",
- " batch_size=batch_size,\n",
- " use_cache=metadata.other.kv_cache,\n",
- " early_stopping=True,\n",
- " )\n",
- " \n",
- "print(f'Encoder time: {percentile_print(encoder_trt_time)}')\n",
- "print(f'Decoder time: {percentile_print(decoder_trt_time)}')\n",
- "print(f'Full E2E time: {percentile_print(full_trt_time)}')\n",
- "\n",
- "# FP16\n",
- "encoder_last_hidden_states, encoder_trt_time_fp16 = encoder_inference(bart_trt_encoder_fp16, input_ids, timing_profile)\n",
- "_, decoder_trt_time_fp16 = decoder_inference(bart_trt_decoder_fp16, expand_inputs_for_beam_search(input_ids, num_beams) if num_beams > 1 else input_ids, expand_inputs_for_beam_search(encoder_last_hidden_states, num_beams) if num_beams > 1 else encoder_last_hidden_states, timing_profile)\n",
- "\n",
- "if num_beams == 1:\n",
- " _, full_trt_time_fp16 = full_inference_greedy(\n",
- " bart_trt_encoder_fp16,\n",
- " bart_trt_decoder_fp16,\n",
- " input_ids,\n",
- " tokenizer,\n",
- " timing_profile,\n",
- " max_length=max_output_len,\n",
- " min_length=BARTModelTRTConfig.MIN_OUTPUT_LENGTH[metadata.variant],\n",
- " batch_size=batch_size,\n",
- " use_cache=metadata.other.kv_cache,\n",
- " )\n",
- "else:\n",
- " _, full_trt_time_fp16 = full_inference_beam(\n",
- " bart_trt_encoder_fp16,\n",
- " bart_trt_decoder_fp16,\n",
- " input_ids,\n",
- " tokenizer,\n",
- " timing_profile,\n",
- " num_beams=num_beams,\n",
- " max_length=max_output_len,\n",
- " min_length=BARTModelTRTConfig.MIN_OUTPUT_LENGTH[metadata.variant],\n",
- " batch_size=batch_size,\n",
- " use_cache=metadata.other.kv_cache,\n",
- " early_stopping=True,\n",
- " )\n",
- "print(f'Encoder FP16 time: {percentile_print(encoder_trt_time_fp16)}')\n",
- "print(f'Decoder FP16 time: {percentile_print(decoder_trt_time_fp16)}')\n",
- "print(f'Full E2E FP16 time: {percentile_print(full_trt_time_fp16)}')"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "e27cda12-7e56-4a87-935d-ce598557cf26",
- "metadata": {},
- "source": [
- "## Comparison"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "1090b46c-adec-4684-8c53-a54a196dedb1",
- "metadata": {},
- "outputs": [],
- "source": [
- "from tabulate import tabulate\n",
- "\n",
- "data = [\n",
- " ['Framework', 'Precision', 'Encoder p50 (ms)', 'Decoder p50 (ms)', 'Full E2E p50 (ms)', 'Accuracy'],\n",
- " ['HuggingFace (w/o cache)', 'FP32', '-', '-', f'{hf_nonkv_time[0]*1000:.2f}', '-'],\n",
- " ['HuggingFace (w/ cache)', 'FP32', '-', '-', f'{hf_kv_time[0]*1000:.2f}', '-'],\n",
- " ['HuggingFace (w/o cache)', 'FP16', '-', '-', f'{hf_nonkv_time_fp16[0]*1000:.2f}', '-'],\n",
- " ['HuggingFace (w/ cache)', 'FP16', '-', '-', f'{hf_kv_time_fp16[0]*1000:.2f}', '-'],\n",
- " ['PyTorch', 'FP32', f'{encoder_pytorch_time[0]*1000:.2f}', f'{decoder_pytorch_time[0]*1000:.2f}', f'{full_pytorch_time[0]*1000:.2f}', outputs_pytorch == outputs_hf],\n",
- " ['PyTorch', 'FP16', f'{encoder_pytorch_time_fp16[0]*1000:.2f}', f'{decoder_pytorch_time_fp16[0]*1000:.2f}', f'{full_pytorch_time_fp16[0]*1000:.2f}', outputs_pytorch_fp16 == outputs_hf],\n",
- " ['TensorRT', 'FP32', f'{encoder_trt_time[0]*1000:.2f}', f'{decoder_trt_time[0]*1000:.2f}', f'{full_trt_time[0]*1000:.2f}', outputs_trt == outputs_hf],\n",
- " ['TensorRT', 'FP16', f'{encoder_trt_time_fp16[0]*1000:.2f}', f'{decoder_trt_time_fp16[0]*1000:.2f}', f'{full_trt_time_fp16[0]*1000:.2f}', outputs_trt_fp16 == outputs_hf],\n",
- "]\n",
- "\n",
- "print(tabulate(data, headers='firstrow', tablefmt='github'))"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "92031643-8ee8-4d50-864b-a08e4d551dc6",
- "metadata": {},
- "source": [
- "We can now compare the original HuggingFace model and the TensorRT engine, from both separate encoder/decoder and end-to-end speed difference. For bart-base variant on an NVIDIA Titan V GPU and input/output sequence length around 130, this results in about 2x performance improvement with FP16 inference."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "9a498672-ba25-42b0-b89e-79e0b869943a",
- "metadata": {},
- "source": [
- "## Variable Input/Output Length\n",
- "\n",
- "We can run more tests by varying input/output length, while using the same engines.\n",
- "\n",
- "Note that TensorRT performance depends on optimal selection of the kernels in the engine. The variable length test here uses the same engine built with max input/output length profile, therefore may not represent the best perf. If the use case has known input/output length ranges, it is highly recommended to specify in the TensorRT engine profiles to ensure optimized kernel selection."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "36f25217-be31-45bf-8652-0e18162fa360",
- "metadata": {},
- "source": [
- "### Single example"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "985d8a01-e5b7-449e-9e43-7c8315a2578d",
- "metadata": {},
- "outputs": [],
- "source": [
- "# ensure HF model are on GPU for testing (cells above moved it CPU). For cuda 11.4, disable this block\n",
- "if not cuda_114_mode:\n",
- " bart_model = bart_model.to('cuda').eval()\n",
- "\n",
- " in_len, out_len = 24, 24\n",
- "\n",
- " data = [\n",
- " ['(input_len, output_len)', 'HF FP32 p50 (s)', 'HF FP16 p50 (s)', 'TRT FP32 p50 (s)', 'TRT FP16 p50 (s)'],\n",
- " ]\n",
- "\n",
- " assert in_len <= max_input_len and out_len <= max_output_len\n",
- "\n",
- " in_ids = torch.randint(0, BARTModelTRTConfig.VOCAB_SIZE[BART_VARIANT], (batch_size, in_len)).to('cuda')\n",
- "\n",
- " # HF\n",
- " bart_model.float()\n",
- " hf_32 = measure_python_inference_code(lambda: bart_model.generate(in_ids, min_length=out_len, max_length=out_len, num_beams=num_beams, use_cache=True), timing_profile)\n",
- " bart_model.half()\n",
- " hf_16 = measure_python_inference_code(lambda: bart_model.generate(in_ids, min_length=out_len, max_length=out_len, num_beams=num_beams, use_cache=True), timing_profile)\n",
- "\n",
- " # TRT\n",
- " if num_beams == 1:\n",
- " _, trt_32 = full_inference_greedy(bart_trt_encoder, bart_trt_decoder, in_ids, tokenizer, timing_profile, max_length=out_len, min_length=out_len, batch_size=batch_size, use_cache=metadata.other.kv_cache, use_cuda=True,)\n",
- " _, trt_16 = full_inference_greedy(bart_trt_encoder_fp16, bart_trt_decoder_fp16, in_ids, tokenizer, timing_profile, max_length=out_len, min_length=out_len, batch_size=batch_size, use_cache=metadata.other.kv_cache, use_cuda=True,)\n",
- " else:\n",
- " _, trt_32 = full_inference_beam(bart_trt_encoder, bart_trt_decoder, in_ids, tokenizer, timing_profile, num_beams=num_beams, max_length=out_len, min_length=out_len, batch_size=batch_size, use_cache=metadata.other.kv_cache, early_stopping=True,)\n",
- " _, trt_16 = full_inference_beam(bart_trt_encoder_fp16, bart_trt_decoder_fp16, in_ids, tokenizer, timing_profile, num_beams=num_beams, max_length=out_len, min_length=out_len, batch_size=batch_size, use_cache=metadata.other.kv_cache, early_stopping=True,)\n",
- "\n",
- " data.append([(in_len, out_len), hf_32[0], hf_16[0], trt_32[0], trt_16[0]])\n",
- "\n",
- " print(tabulate(data, headers='firstrow', tablefmt='github'))\n",
- "else:\n",
- " print(\"CUDA 11.4 is currently incompatible with GPU models, skipping\")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "8edf4f5c-49a0-4509-a4d7-8b561dba3f88",
- "metadata": {},
- "source": [
- "### Several representative examples"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "9e335010-ff7f-4822-85ae-bca8d235de1b",
- "metadata": {},
- "outputs": [],
- "source": [
- "# ensure HF model are on GPU for testing (cells above moved it CPU). For cuda 11.4, disable this block\n",
- "if not cuda_114_mode:\n",
- " bart_model = bart_model.to('cuda').eval()\n",
- "\n",
- " input_output_len_list = [\n",
- " (64, 128), # generation task\n",
- " (64, 512),\n",
- " (512, 64), # summarization task\n",
- " (128, 64),\n",
- " (32, 32), # translation task\n",
- " (128, 128),\n",
- " (512, 512),\n",
- " ]\n",
- "\n",
- " data = [\n",
- " ['(input_len, output_len)', 'HF FP32 p50 (s)', 'HF FP16 p50 (s)', 'TRT FP32 p50 (s)', 'TRT FP16 p50 (s)'],\n",
- " ]\n",
- "\n",
- " for (in_len, out_len) in input_output_len_list:\n",
- " assert in_len <= max_input_len and out_len <= max_output_len\n",
- "\n",
- " in_ids = torch.randint(0, BARTModelTRTConfig.VOCAB_SIZE[BART_VARIANT], (batch_size, in_len)).to('cuda')\n",
- "\n",
- " # HF\n",
- " bart_model.float()\n",
- " hf_32 = measure_python_inference_code(lambda: bart_model.generate(in_ids, min_length=out_len, max_length=out_len, num_beams=num_beams, use_cache=True), timing_profile)\n",
- " bart_model.half()\n",
- " hf_16 = measure_python_inference_code(lambda: bart_model.generate(in_ids, min_length=out_len, max_length=out_len, num_beams=num_beams, use_cache=True), timing_profile)\n",
- "\n",
- " # TRT\n",
- " if num_beams == 1:\n",
- " _, trt_32 = full_inference_greedy(bart_trt_encoder, bart_trt_decoder, in_ids, tokenizer, timing_profile, max_length=out_len, min_length=out_len, batch_size=batch_size, use_cache=metadata.other.kv_cache, use_cuda=True,)\n",
- " _, trt_16 = full_inference_greedy(bart_trt_encoder_fp16, bart_trt_decoder_fp16, in_ids, tokenizer, timing_profile, max_length=out_len, min_length=out_len, batch_size=batch_size, use_cache=metadata.other.kv_cache, use_cuda=True,)\n",
- " else:\n",
- " _, trt_32 = full_inference_beam(bart_trt_encoder, bart_trt_decoder, in_ids, tokenizer, timing_profile, num_beams=num_beams, max_length=out_len, min_length=out_len, batch_size=batch_size, use_cache=metadata.other.kv_cache, early_stopping=True,)\n",
- " _, trt_16 = full_inference_beam(bart_trt_encoder_fp16, bart_trt_decoder_fp16, in_ids, tokenizer, timing_profile, num_beams=num_beams, max_length=out_len, min_length=out_len, batch_size=batch_size, use_cache=metadata.other.kv_cache, early_stopping=True,)\n",
- "\n",
- " data.append([(in_len, out_len), hf_32[0], hf_16[0], trt_32[0], trt_16[0]])\n",
- "\n",
- " print(tabulate(data, headers='firstrow', tablefmt='github'))\n",
- "else:\n",
- " print(\"CUDA 11.4 is currently incompatible with GPU models, skipping\")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "a598a0ae-2e21-4898-ae56-8429a5d00760",
- "metadata": {},
- "source": [
- "It shows around 2x speedup comparing to HuggingFace's KV-cache optimized timing, for relatively short output sequence length. For long output sequence length, due to memory copies overhead between the decoding steps, TensorRT may not provide significant speedup at the current stage."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "2a1f5dca-397c-4c8c-9200-61b30cdba824",
- "metadata": {},
- "source": [
- "## Conclusion\n",
- "\n",
- "This notebook has walked you through the process of converting a HuggingFace PyTorch BART model to an optimized TensorRT engine for inference in easy steps. The TensorRT inference engine can be conviniently used as a drop-in replacement for the orginial HuggingFace BART model while providing speed up. \n",
- "\n",
- "If you are interested in further details of the conversion process, check out [BART/trt.py](../BART/trt.py)"
- ]
- }
- ],
- "metadata": {
- "interpreter": {
- "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
- },
- "kernelspec": {
- "display_name": "Python 3 (ipykernel)",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.8.10"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/demo/HuggingFace/notebooks/gpt2-playground.ipynb b/demo/HuggingFace/notebooks/gpt2-playground.ipynb
deleted file mode 100644
index 76c92d19..00000000
--- a/demo/HuggingFace/notebooks/gpt2-playground.ipynb
+++ /dev/null
@@ -1,243 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "64974d33-d028-440c-86fa-1a0633b3d31d",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Copyright 2021 NVIDIA Corporation. All Rights Reserved.\n",
- "#\n",
- "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
- "# you may not use this file except in compliance with the License.\n",
- "# You may obtain a copy of the License at\n",
- "#\n",
- "# http://www.apache.org/licenses/LICENSE-2.0\n",
- "#\n",
- "# Unless required by applicable law or agreed to in writing, software\n",
- "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
- "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
- "# See the License for the specific language governing permissions and\n",
- "# limitations under the License.\n",
- "# =============================================================================="
- ]
- },
- {
- "cell_type": "markdown",
- "id": "c3f0ff46-9958-4d57-9067-a64be34e75da",
- "metadata": {},
- "source": [
- "\n",
- "\n",
- "# GPT-2 Playground\n",
- "\n",
- "This notebook demonstrates the GPT-2 model for open-end text generation.\n",
- "\n",
- "The TensorRT HuggingFace GPT-2 model is a plug-in replacement for the original PyTorch HuggingFace GPT-2 model.\n",
- "\n",
- "\n",
- "**Notes**: \n",
- " - For \"CPU - PyTorch\" and \"GPU - PyTorch\", a GPT-2 small model from HuggingFace model repository is employed. Inference is carried out with PyTorch in FP32 precision. All models run with batch size 1.\n",
- "Average run time across 5 runs is reported.\n",
- " - Prior to running this notebook, run [gpt2.ipynb](gpt2.ipynb) to download the GPT-2 model and generate the TensorRT engine."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "3530e767-7050-4329-a4bc-e2221b9eb578",
- "metadata": {
- "jupyter": {
- "source_hidden": true
- },
- "tags": []
- },
- "outputs": [],
- "source": [
- "import os\n",
- "import sys\n",
- "ROOT_DIR = os.path.abspath(\"../\")\n",
- "sys.path.append(ROOT_DIR)\n",
- "\n",
- "import warnings\n",
- "warnings.filterwarnings('ignore')\n",
- "\n",
- "import torch \n",
- "\n",
- "# huggingface\n",
- "from transformers import (\n",
- " GPT2LMHeadModel,\n",
- " GPT2Tokenizer,\n",
- " GPT2Config,\n",
- ")\n",
- "\n",
- "from GPT2.trt import GPT2TRTDecoder, GPT2TRTEngine\n",
- "from NNDF.networks import NetworkMetadata, Precision\n",
- "from collections import namedtuple \n",
- "from GPT2.GPT2ModelConfig import GPT2ModelTRTConfig, GPT2Metadata\n",
- "\n",
- "# download HuggingFace model and tokernizer\n",
- "GPT2_VARIANT = 'gpt2' # choices: gpt2 | gpt2-large\n",
- "model = GPT2LMHeadModel.from_pretrained(GPT2_VARIANT)\n",
- "config = GPT2Config(GPT2_VARIANT)\n",
- "tokenizer = GPT2Tokenizer.from_pretrained(GPT2_VARIANT)\n",
- "\n",
- "# load TensorRT engine\n",
- "metadata = NetworkMetadata(variant=GPT2_VARIANT, precision=Precision(fp16=False), other=GPT2Metadata(kv_cache=False))\n",
- "from os.path import exists\n",
- "if not exists('./models/gpt2/trt-engine/gpt2.onnx.engine'):\n",
- " print(\"Error: TensorRT engine not found at ./models/gpt2/trt-engine/gpt2.onnx.engine. Please run gpt2.ipynb to generate the TensorRT engine first!\")\n",
- "else:\n",
- " gpt2_engine = GPT2TRTEngine('./models/gpt2/trt-engine/gpt2.onnx.engine', metadata)\n",
- " gpt2_trt = GPT2TRTDecoder(gpt2_engine, metadata, config)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "766b8c94-ba8e-47c8-8624-57da462a0496",
- "metadata": {
- "jupyter": {
- "source_hidden": true
- },
- "tags": []
- },
- "outputs": [],
- "source": [
- "import ipywidgets as widgets\n",
- "import numpy as np\n",
- "import time\n",
- "\n",
- "device = widgets.RadioButtons(\n",
- " options=['CPU - PyTorch', \n",
- " 'GPU - PyTorch', \n",
- " 'GPU - TensorRT'],\n",
- " description='Device:',\n",
- " disabled=False\n",
- ")\n",
- "\n",
- "paragraph_text = widgets.Textarea(\n",
- " value='TensorRT is a high performance deep learning inference platform that delivers low latency and high throughput for apps '\\\n",
- "'such as recommenders, speech and image/video on NVIDIA GPUs. ',\n",
- " placeholder='Type something',\n",
- " description='Context:',\n",
- " disabled=False,\n",
- " layout=widgets.Layout(width=\"auto\"),\n",
- " rows=5, \n",
- ")\n",
- "\n",
- "generated_text = widgets.Textarea(\n",
- " value='...',\n",
- " placeholder='GPT-2 generated text',\n",
- " description='GPT-2:',\n",
- " disabled=False,\n",
- " layout=widgets.Layout(width=\"auto\"),\n",
- " rows=5,\n",
- ")\n",
- "button = widgets.Button(description=\"Generate\")\n",
- "\n",
- "display(paragraph_text)\n",
- "display(generated_text)\n",
- "display(device)\n",
- "\n",
- "from IPython.display import display\n",
- "box_layout = widgets.Layout(display='flex',\n",
- " flex_flow='column',\n",
- " align_items='center',\n",
- " width='100%')\n",
- "N_RUN = 6\n",
- "progress_bar = widgets.IntProgress(\n",
- " value=0,\n",
- " min=0,\n",
- " max=N_RUN,\n",
- " description='Progress:',\n",
- " bar_style='', # 'success', 'info', 'warning', 'danger' or ''\n",
- " style={'bar_color': 'green'},\n",
- " orientation='horizontal', \n",
- " layout=widgets.Layout(width='100%', height='50px')\n",
- ")\n",
- "\n",
- "box = widgets.HBox(children=[button],layout=box_layout)\n",
- "output = widgets.Output()\n",
- "display(box)\n",
- "display(progress_bar)\n",
- "display(output)\n",
- "\n",
- "def generate(b):\n",
- " progress_bar.value = 0\n",
- " inference_time_arr = []\n",
- " with output:\n",
- " if device.value == 'GPU - TensorRT':\n",
- " inputs = tokenizer(paragraph_text.value, return_tensors=\"pt\")\n",
- " for _ in range(N_RUN):\n",
- " start_time = time.time()\n",
- " sample_output = gpt2_trt.generate(inputs.input_ids.to('cuda:0'), max_length=GPT2ModelTRTConfig.MAX_SEQUENCE_LENGTH[GPT2_VARIANT])\n",
- " inference_time_arr.append(time.time()-start_time)\n",
- " progress_bar.value += 1\n",
- "\n",
- " # de-tokenize model output to raw text\n",
- " text = tokenizer.decode(sample_output[0], skip_special_tokens=True)\n",
- " generated_text.value = text\n",
- " print(\"GPU - TensorRT - Average inference time: %.2f (ms)\"%(1000*np.mean(inference_time_arr[1:]))) \n",
- " \n",
- " elif device.value == 'CPU - PyTorch':\n",
- " inputs = tokenizer(paragraph_text.value, return_tensors=\"pt\")\n",
- " for _ in range(N_RUN):\n",
- " start_time = time.time()\n",
- " sample_output = model.to('cpu').generate(inputs.input_ids.to('cpu'), max_length=GPT2ModelTRTConfig.MAX_SEQUENCE_LENGTH[GPT2_VARIANT])\n",
- " inference_time_arr.append(time.time()-start_time)\n",
- " progress_bar.value += 1\n",
- "\n",
- " # de-tokenize model output to raw text\n",
- " text = tokenizer.decode(sample_output[0], skip_special_tokens=True)\n",
- " generated_text.value = text\n",
- " print(\"CPU - PyTorch - Average inference time: %.2f (ms)\"%(1000*np.mean(inference_time_arr[1:])))\n",
- " \n",
- " elif device.value == 'GPU - PyTorch': \n",
- " inputs = tokenizer(paragraph_text.value, return_tensors=\"pt\")\n",
- " for _ in range(N_RUN):\n",
- " start_time = time.time()\n",
- " sample_output = model.to('cuda:0').generate(inputs.input_ids.to('cuda:0'), max_length=GPT2ModelTRTConfig.MAX_SEQUENCE_LENGTH[GPT2_VARIANT])\n",
- " inference_time_arr.append(time.time()-start_time)\n",
- " progress_bar.value += 1\n",
- "\n",
- " # de-tokenize model output to raw text\n",
- " text = tokenizer.decode(sample_output[0], skip_special_tokens=True)\n",
- " generated_text.value = text\n",
- " print(\"GPU - PyTorch - Average inference time: %.2f (ms)\"%(1000*np.mean(inference_time_arr[1:]))) \n",
- " \n",
- "button.on_click(generate)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "58f473c0-6682-41af-8040-72f0a9472b0f",
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.6.9"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/demo/HuggingFace/notebooks/gpt2.ipynb b/demo/HuggingFace/notebooks/gpt2.ipynb
deleted file mode 100644
index 745b996b..00000000
--- a/demo/HuggingFace/notebooks/gpt2.ipynb
+++ /dev/null
@@ -1,1218 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "28e6e614-e360-4292-965e-0d255027e9b9",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "# Copyright 2021 NVIDIA Corporation. All Rights Reserved.\n",
- "#\n",
- "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
- "# you may not use this file except in compliance with the License.\n",
- "# You may obtain a copy of the License at\n",
- "#\n",
- "# http://www.apache.org/licenses/LICENSE-2.0\n",
- "#\n",
- "# Unless required by applicable law or agreed to in writing, software\n",
- "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
- "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
- "# See the License for the specific language governing permissions and\n",
- "# limitations under the License.\n",
- "# =============================================================================="
- ]
- },
- {
- "cell_type": "markdown",
- "id": "9b88dc1a-a92d-44cc-9fb7-d9e2ef20c8e2",
- "metadata": {},
- "source": [
- "\n",
- "\n",
- "# Accelerating HuggingFace GPT-2 Inference with TensorRT\n",
- "\n",
- "GPT-2 is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion. The model was pretrained on the raw texts to guess the next word in sentences. As no human labeling was required, GPT-2 pretraining can use lots of publicly available data with an automatic process to generate inputs and labels from those data.\n",
- "\n",
- "This notebook shows 3 easy steps to convert a [HuggingFace PyTorch GPT-2 model](https://huggingface.co/gpt2) to a TensorRT engine for high-performance inference.\n",
- "\n",
- "1. [Download HuggingFace GPT-2 model ](#1)\n",
- "1. [Convert to ONNX format](#2)\n",
- "1. [Convert to TensorRT engine](#3)\n",
- "1. [Advanced Topic: KV Cache](#4)\n",
- "1. [Advanced Topic: Beam Search](#5)\n",
- "\n",
- "## Prerequisite\n",
- "\n",
- "Follow the instruction at https://github.com/NVIDIA/TensorRT to build the TensorRT-OSS docker container required to run this notebook.\n",
- "\n",
- "Next, we install some extra dependencies."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "79281ed9-4855-4ade-a810-a2899a5872b9",
- "metadata": {
- "custom": {
- "metadata": {
- "tags": [
- "skip-execution"
- ]
- }
- },
- "language": "python",
- "tags": []
- },
- "outputs": [],
- "source": [
- "%%capture\n",
- "!pip3 install -r ../requirements.txt"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "d3e57ece",
- "metadata": {},
- "source": [
- "**Note:** After this step, you should restart the Jupyter kernel for the change to take effect."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "235d2f1b-439e-4cd0-8286-1d63a13f2cf3",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "import os\n",
- "import sys\n",
- "ROOT_DIR = os.path.abspath(\"../\")\n",
- "sys.path.append(ROOT_DIR)\n",
- "\n",
- "import torch \n",
- "\n",
- "# huggingface\n",
- "from transformers import (\n",
- " GPT2LMHeadModel,\n",
- " GPT2Tokenizer,\n",
- " GPT2Config,\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "af4254e2-11fd-4bc7-ac0b-60b1a9e07c4e",
- "metadata": {},
- "source": [
- "\n",
- "\n",
- "## 1. Download HuggingFace GPT-2 model \n",
- "\n",
- "First, we download the original HuggingFace PyTorch GPT-2 model from HuggingFace model hubs, together with its associated tokernizer.\n",
- "\n",
- "The GPT-2 variants supported by TensorRT 8 are: gpt2 (117M), gpt2-medium (355M), gpt2-large (774M), gpt2-xl (1.5B)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "fae66d58-f994-4987-8f1d-1fa8ac2ec8b4",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "# download model and tokernizer\n",
- "GPT2_VARIANT = 'gpt2' # choices: gpt2 | gpt2-medium | gpt2-large | gpt2-xl\n",
- "config = GPT2Config(GPT2_VARIANT)\n",
- "\n",
- "model = GPT2LMHeadModel.from_pretrained(GPT2_VARIANT, force_download = False)\n",
- "tokenizer = GPT2Tokenizer.from_pretrained(GPT2_VARIANT)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "7252ca90-1104-40dc-8e72-f51c07a4cd11",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "# save model locally\n",
- "pytorch_model_dir = './models/{}/pytorch'.format(GPT2_VARIANT)\n",
- "!mkdir -p $pytorch_model_dir\n",
- "\n",
- "model.save_pretrained(pytorch_model_dir)\n",
- "print(\"Pytorch Model saved to {}\".format(pytorch_model_dir))"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "a84c5766-97ed-4d04-bab5-7fa18e89dee8",
- "metadata": {},
- "source": [
- "### Inference with PyTorch model"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "e43067c2-ecd9-4bd6-9047-a3f74621931b",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "# carry out inference with a single sample\n",
- "input_str = \"Hello, my dog is \"\n",
- "inputs = tokenizer(input_str, return_tensors=\"pt\")\n",
- "input_ids = inputs.input_ids"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "6d347ddf-4504-4ab7-b15b-29d218bdd7a8",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "input_ids, input_ids.shape"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "cf83454f",
- "metadata": {},
- "outputs": [],
- "source": [
- "# WAR: Using an ugly representation because cuda 11.4 does not support GPU models due to cublas errors\n",
- "if \"cuda-11.4\" in os.environ[\"LD_LIBRARY_PATH\"]:\n",
- " model = model.cpu()\n",
- " input_ids = input_ids.cpu()\n",
- " inputs = inputs.to('cpu')\n",
- "else:\n",
- " model = model.cuda()\n",
- " input_ids = input_ids.cuda()\n",
- " inputs = inputs.to('cuda:0')"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "39d6c2ea-3450-4b8b-9cc8-09943d967ece",
- "metadata": {},
- "source": [
- "#### Single example inference"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "b844f057-e768-467d-9185-68fb4c74b5ab",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "model.eval()\n",
- "with torch.no_grad():\n",
- " outputs = model(**inputs, labels=inputs['input_ids'], use_cache = False)\n",
- "\n",
- "logits = outputs.logits"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "717b2f68-9d92-474e-9937-8b42a1c60d14",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "logits, logits.shape"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "a6c0468b-976a-4a08-98d3-e87578ec067f",
- "metadata": {},
- "source": [
- "For benchmarking purposes, we will employ a helper function `gpt2_inference` which executes the inference on a single batch repeatedly and measures end to end execution time. Let's take note of this execution time for later comparison with TensorRT. \n",
- " \n",
- "`TimingProfile` is a named tuple that specifies the number of experiments and number of times to call the function per iteration (and number of warm-up calls although it is not used here)."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "ecdf8f00-0562-482b-9bec-b0b7596aec48",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "from GPT2.measurements import gpt2_inference\n",
- "from NNDF.networks import TimingProfile\n",
- "\n",
- "# Benchmarking TensorRT performance on single batch\n",
- "_, decoder_e2e_median_time = gpt2_inference(\n",
- " model, input_ids, TimingProfile(iterations=10, number=1, warmup=1, duration=0, percentile=50)\n",
- " )\n",
- "decoder_e2e_median_time"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "4805756f-81f9-43cf-88f6-b205ecd23034",
- "metadata": {},
- "source": [
- "#### Open-end text generation\n",
- "Next, we will employ the PyTorch model for the open-end text generation task, which GPT-2 is particularly good at. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "1bb282bf-a8f4-47c4-830e-f2fb69d9d8d5",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "from GPT2.GPT2ModelConfig import GPT2ModelTRTConfig\n",
- "# MAX_LENGTH represents the maximum length that GPT2 could be used in text generation. \n",
- "# This corresponds to max_length in task_specific_params for text-generation, which = 50 for each model config.\n",
- "# If the length exceeds max_length, the output becomes meaningless for the specific task.\n",
- "max_length = GPT2ModelTRTConfig.MAX_LENGTH[GPT2_VARIANT]"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "8c3d01fc-9928-486b-9d15-de84d46528e5",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "sample_output = model.generate(input_ids, max_length=max_length, use_cache = False)\n",
- "\n",
- "# de-tokenize model output to raw text\n",
- "tokenizer.decode(sample_output[0], skip_special_tokens=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "0b016c2f-7982-44ac-81e5-d3854391a8b6",
- "metadata": {},
- "source": [
- "For benchmarking purposes, we will employ a helper function `full_inference` which executes the inference repeatedly and measures end to end execution time. Let's take note of this execution time for later comparison with TensorRT. \n",
- "\n",
- "TimingProfile is a named tuple that specifies the number of experiments and number of times to call the function per iteration (and number of warm-up calls although it is not used here)."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "93aea249-529e-4b5e-9759-e0c8370391a3",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "from GPT2.measurements import full_inference\n",
- "\n",
- "# get complete decoder inference result and its timing profile\n",
- "_, full_e2e_median_runtime = full_inference(\n",
- " model, input_ids, tokenizer, TimingProfile(iterations=10, number=1, warmup=1, duration=0, percentile=50),\n",
- " max_length=max_length\n",
- ")\n",
- "full_e2e_median_runtime"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "0d662701-e430-4fdc-ad46-1f296defcf8f",
- "metadata": {},
- "source": [
- "\n",
- "\n",
- "## 2. Convert to ONNX format\n",
- "\n",
- "Prior to converting the model to a TensorRT engine, we will first convert the PyTorch model to an intermediate universal format: ONNX.\n",
- "\n",
- "ONNX is an open format for machine learning and deep learning models. It allows you to convert deep learning and machine learning models from different frameworks such as TensorFlow, PyTorch, MATLAB, Caffe, and Keras to a single format.\n",
- "\n",
- "At a high level, the steps to convert a PyTorch model to TensorRT are as follows:\n",
- "- Convert the pretrained image segmentation PyTorch model into ONNX.\n",
- "- Import the ONNX model into TensorRT.\n",
- "- Apply optimizations and generate an engine.\n",
- "- Perform inference on the GPU with the TensorRT engine. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "c2b2be1a-021c-4f6c-957d-2ff7d1b95976",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "from NNDF.networks import NetworkMetadata, Precision\n",
- "from GPT2.export import GPT2TorchFile\n",
- "from GPT2.GPT2ModelConfig import GPT2Metadata"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "7144d206-c690-4d4c-b590-3eb25e31d106",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "metadata = NetworkMetadata(variant=GPT2_VARIANT, precision=Precision(fp16=False), other=GPT2Metadata(kv_cache=False)) # kv_cache is disabled because it exports extra input/output to the model\n",
- "gpt2 = GPT2TorchFile(model.to('cpu'), metadata)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "dbaa89e4-e83d-4380-a6f8-932fcfeb64d3",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "!mkdir -p ./models/$GPT2_VARIANT/ONNX\n",
- "\n",
- "onnx_path = ('./models/{}/ONNX/{}.onnx'.format(GPT2_VARIANT, GPT2_VARIANT))\n",
- "gpt2.as_onnx_model(onnx_path, force_overwrite=False)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "88b04de1-e887-445c-9bc8-e2a7e0fca7ea",
- "metadata": {},
- "source": [
- "Let's take a look at the onnx file and investigate its input and output. You should see that \"input_ids\" as the input, and \"logits\" as the output."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "7e4fff25-97da-4f9f-ae98-e918745faebb",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "import onnx"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "03c409e6-d312-4cc7-b13f-4621609d5633",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "onnx_model = onnx.load(onnx_path)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "2314caaf-836d-4140-93e4-4b3f4c931347",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "onnx_model.graph.input"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "4fe7a8d4-2bc3-49fc-863a-0e7f4be6565e",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "onnx_model.graph.output"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "7baf007e-5508-485c-a87f-9bfe16260452",
- "metadata": {},
- "source": [
- "\n",
- "\n",
- "## 3. Convert to TensorRT engine\n",
- "\n",
- "Now we are ready to parse the ONNX model and convert it to an optimized TensorRT model.\n",
- "\n",
- "Since the model contains dynamic input shapes, we can specify a valid input range with a TensorRT optimization profile.\n",
- "\n",
- "Note: As TensorRT carries out many optimization, this conversion process for the larger model might take a while."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "037ac958-2627-439c-9db5-27640e3f7967",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "from polygraphy.backend.trt import Profile\n",
- "from tensorrt import PreviewFeature\n",
- "from GPT2.export import GPT2ONNXFile, GPT2TRTEngine"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "6bd6e3fc-6797-46b0-a211-ce42d3769105",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "!mkdir -p ./models/$GPT2_VARIANT/trt-engine\n",
- "trt_engine_folder = './models/{}/trt-engine'.format(GPT2_VARIANT)\n",
- "\n",
- "# Create optimization profile for dynamic shape input. Can modify batch_size / max_sequence_length to build engines for different shapes\n",
- "batch_size = 1\n",
- "disable_preview_dynamic_shapes = False # preview_dynamic_shapes optimizes the trt engine building time\n",
- "# We can either use input length as the optimal length, or use max_length // 2. \n",
- "# In T5 or BART, input_length is better, but in GPT-2, max_length // 2 is better because we need to generate max_length number of tokens\n",
- "\n",
- "use_input_length = False\n",
- "opt_length = input_id.shape[1] if use_input_length else max_length // 2 \n",
- "# Create different engine tags for different configurations\n",
- "engine_tag = f\"bs{batch_size}\"\n",
- "preview_features = [PreviewFeature.DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]\n",
- "if disable_preview_dynamic_shapes:\n",
- " engine_tag += \"-noPreviewFasterDynamicShapes\"\n",
- "else:\n",
- " preview_features += [PreviewFeature.FASTER_DYNAMIC_SHAPES_0805]\n",
- "\n",
- "profiles = [Profile().add(\n",
- " \"input_ids\",\n",
- " min=(batch_size, 1),\n",
- " opt=(batch_size, opt_length), # Optimized based on the inputs. \n",
- " max=(batch_size, max_length),\n",
- ")]"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "5538106b-3ae4-4d5f-b0ee-1f76174dcecc",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "profiles"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "3a5934f0-46d3-45d7-8dd5-6cf81de61e66",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "engine_path = os.path.join(trt_engine_folder, f\"{GPT2_VARIANT}-{engine_tag}.engine\")\n",
- "if not os.path.exists(engine_path):\n",
- " gpt2_engine = GPT2ONNXFile(onnx_path, metadata).as_trt_engine(output_fpath=engine_path, profiles=profiles, preview_features=preview_features)\n",
- "else:\n",
- " gpt2_engine = GPT2TRTEngine(engine_path, metadata)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "74f7f6fc-1e6a-4ddc-8e9b-543d9e8dab4d",
- "metadata": {},
- "source": [
- "### Inference with TensorRT engine\n",
- "\n",
- "Great, if you have reached this stage, it means we now have an optimized TensorRT engine for the GPT-2 model, ready for us to carry out inference. \n",
- "\n",
- "The GPT-2 model with TensorRT backend can now be employed in place of the original HuggingFace GPT-2 model."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "54ae13aa-bf6f-4eb7-a453-389865562ae4",
- "metadata": {},
- "source": [
- "#### Single batch inference\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "343b58f1-3d9f-4844-85c9-73058bd36a83",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "from GPT2.trt import GPT2TRTDecoder\n",
- "config = GPT2Config.from_pretrained(GPT2_VARIANT, use_cache = False)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "0cfda583-b684-48b1-9046-15ab022ef982",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "gpt2_trt = GPT2TRTDecoder(gpt2_engine, metadata, config)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "28fc60ad-73a7-46df-85d7-a292a8abbd80",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "# Benchmarking TensorRT performance on single batch\n",
- "_, decoder_e2e_median_time = gpt2_inference(\n",
- " gpt2_trt, input_ids, TimingProfile(iterations=10, number=1, warmup=1, duration=0, percentile=50)\n",
- " )\n",
- "decoder_e2e_median_time"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "01d86d29-1c7b-4020-9ef2-b77ea5e52764",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "with torch.no_grad():\n",
- " outputs = gpt2_trt(input_ids=input_ids)\n",
- "logits = outputs.logits"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "d32e0162-c9eb-473d-ace6-c4c61ff578b5",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "logits, logits.shape"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "22122064-5a17-4990-bd6b-073fca5a3e9b",
- "metadata": {},
- "source": [
- "#### Open-end text generation\n",
- "Let's generate the same task again. Since GPT-2 is an open-ended model, a small turbulent in the model might have a very different result. Since we have done some format changes and input/output restriction while exporting the model, you might see a different result compared to raw HuggingFace model. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "848bffb8-a7a4-4fcb-91c9-f4e9f7263e6c",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "sample_output = gpt2_trt.generate(input_ids.cuda(), max_length=max_length)\n",
- "\n",
- "# de-tokenize model output to raw text\n",
- "tokenizer.decode(sample_output[0], skip_special_tokens=True)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "b4c8bc4c-bf3e-4cb5-afc6-c0bd7d8655cb",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "# get complete decoder inference result and its timing profile\n",
- "_, full_e2e_median_runtime = full_inference(\n",
- " gpt2_trt, input_ids.cuda(), tokenizer, TimingProfile(iterations=10, number=1, warmup=1, duration=0, percentile=50),\n",
- " max_length=max_length\n",
- ")\n",
- "full_e2e_median_runtime"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "6b68a915-2c32-49e5-b1f6-e93d7618f637",
- "metadata": {},
- "source": [
- "You can now compare the output of the original PyTorch model and the TensorRT engine. Notice the speed difference. On an NVIDIA V100 32GB GPU, this results in about ~5x performance improvement for the GPT-2 model (from an average of 0.704s to 0.134s)."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "b2562388-d97b-45dd-8569-3f6c053f4e98",
- "metadata": {},
- "source": [
- "Now you have known how to convert a model to onnx, build TRT engine and optimize it. As you might have recalled, using kv cache and beam search are two important ways to improve the performance of the decoder models. We have recently added thse support to our HuggingFace demo. "
- ]
- },
- {
- "cell_type": "markdown",
- "id": "a4132e54-aba7-42ec-8324-c68d82c17296",
- "metadata": {
- "tags": []
- },
- "source": [
- "\n",
- "\n",
- "## 4. Advanced Topic: KV Cache\n",
- "\n",
- "As you have seen above, we put `use_cache = False` in some code blocks. This is because in the simplified model, we only take `input_ids` as input and `logits` as output. `input_ids` is growing as the sequence goes longer. In reality, we sometimes cache the self-attentions for each layer and reuse them in the later computations. This allows us to only take the last generated `input_ids`. This is a trade-off between space and time. When the model is small or the sequence is small, the D2D data copy time usually outweights the performance improvement of the model. However, performance improvements have been found in larger models with larger sequence length like 512. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "e33d1dcb-250f-4d86-9726-b114d4962fd4",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "use_cache = True\n",
- "kv_config = GPT2Config.from_pretrained(GPT2_VARIANT, use_cache = use_cache)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "fd8fdf0f-2da0-46c0-a948-e4e6e16b898a",
- "metadata": {},
- "source": [
- "#### Raw HuggingFace\n",
- "\n",
- "The model that we download from `GPT2LMHeadModel.from_pretrained` is dynamic in its inputs. It can take both kv and non-kv configurations. Changing `use_cache` will do it. You can see that changing this configuration, the output is changed. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "26b3c51a-07ee-4936-b620-50766a45b945",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "# get complete decoder inference result and its timing profile\n",
- "_, full_e2e_median_runtime = full_inference(\n",
- " model, input_ids, tokenizer, TimingProfile(iterations=10, number=1, warmup=1, duration=0, percentile=50),\n",
- " max_length=max_length, use_cache = use_cache\n",
- ")\n",
- "full_e2e_median_runtime"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "a14607bf-f449-4151-9076-d099ae1a3ae1",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "sample_output = model.generate(input_ids, max_length=max_length, use_cache = use_cache)\n",
- "\n",
- "# de-tokenize model output to raw text\n",
- "tokenizer.decode(sample_output[0], skip_special_tokens=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "9057ef83-0cdc-4631-9958-66d04fc7fc22",
- "metadata": {},
- "source": [
- "#### TensorRT\n",
- "\n",
- "For the 1st decoding step, we take `input_ids` and generate both `logits` and the kv cache. In other steps, we take the new `input_ids` with `past` kv-cache and the outputs are `logits` and the updated `present` kv-cache. Taking dynamic number of inputs for trt is not currently supported in our demo, so we need to output 2 onnx files and build 2 engines."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "f1fbcfad-9c9c-47e2-894a-731c7a3a04df",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "kv_metadata = NetworkMetadata(variant=GPT2_VARIANT, precision=Precision(fp16=False), other=GPT2Metadata(kv_cache=use_cache))\n",
- "kv_gpt2 = GPT2TorchFile(model.to('cpu'), kv_metadata)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "3fe680c5-d9ff-466f-87fe-a7bb0cbee944",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "kv_onnx_path = ('./models/{}/ONNX/{}-kv_cache.onnx'.format(GPT2_VARIANT, GPT2_VARIANT))\n",
- "kv_gpt2.as_onnx_model(kv_onnx_path, force_overwrite=False)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "4f0f6824-286d-4afa-926b-7eed4cafafc7",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "kv_onnx_model = onnx.load(kv_onnx_path)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "b71f0012-7a2d-41be-a8d8-c818dcb7c244",
- "metadata": {},
- "source": [
- "We could see that the kv model has #inputs = #outputs = num_layers * 2 + 1"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "7579aeec-2c7a-43de-b8f7-beff8d3d7784",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "len(kv_onnx_model.graph.input), len(kv_onnx_model.graph.output)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "9add1139-aab0-4531-b2ac-c3aca90e5d49",
- "metadata": {},
- "source": [
- "The next blocks will set up the profile and build the engine. The only difference is that we now have the profile for kv cache"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "9ae055cb-41b7-4523-86bc-490bc9edf204",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "batch_size = 1\n",
- "disable_preview_dynamic_shapes = False\n",
- "\n",
- "engine_tag = \"bs{}\".format(batch_size)\n",
- "\n",
- "preview_features = [PreviewFeature.FASTER_DYNAMIC_SHAPES_0805]\n",
- "if disable_preview_dynamic_shapes:\n",
- " engine_tag += \"-disableFasterDynamicShapes\"\n",
- " preview_features = []\n",
- "\n",
- "use_input_length = False\n",
- "num_heads = kv_config.n_head\n",
- "embedding_size_per_head = kv_config.n_embd // num_heads\n",
- "num_layers = kv_config.n_layer\n",
- "\n",
- "max_sequence_length = max_length\n",
- "max_output_length = max_length\n",
- "if not use_input_length:\n",
- " opt_input_seq_len = max_sequence_length // 2\n",
- "else:\n",
- " opt_input_seq_len = input_ids.shape[1]\n",
- "\n",
- "opt_output_seq_len = max_output_length // 2\n",
- "\n",
- "# context phase uses the provided input_ids to generate hidden states and self attention kv cache\n",
- "# It is only used in the 1st decoder run.\n",
- "dec_profiles_context = Profile().add(\n",
- " \"input_ids\",\n",
- " min=(batch_size, 1),\n",
- " opt=(batch_size, opt_output_seq_len),\n",
- " max=(batch_size, max_output_length),\n",
- ")\n",
- "self_attention_profile_context = {\n",
- " \"min\": (batch_size, num_heads, 0, embedding_size_per_head),\n",
- " \"opt\": (batch_size, num_heads, 0, embedding_size_per_head),\n",
- " \"max\": (batch_size, num_heads, 0, embedding_size_per_head),\n",
- "}\n",
- "\n",
- "# generation phase uses previous self attention kv cache with the last input_ids token to generate the next hidden states and self attention kv cache\n",
- "# This optimization profile is used after the 1st decoder run.\n",
- "dec_profiles_generation = Profile().add(\n",
- " \"input_ids\",\n",
- " min=(batch_size, 1),\n",
- " opt=(batch_size, 1),\n",
- " max=(batch_size, 1),\n",
- ")\n",
- "\n",
- "self_attention_profile_generation = {\n",
- " \"min\": (batch_size, num_heads, 1, embedding_size_per_head),\n",
- " \"opt\": (batch_size, num_heads, opt_output_seq_len - 1, embedding_size_per_head),\n",
- " \"max\": (batch_size, num_heads, max_output_length - 1, embedding_size_per_head),\n",
- "}\n",
- "\n",
- "for i in range(num_layers):\n",
- " dec_profiles_context = dec_profiles_context.add(\n",
- " f\"past_key_values.{i}.decoder.key\",\n",
- " **self_attention_profile_context\n",
- " ).add(\n",
- " f\"past_key_values.{i}.decoder.value\",\n",
- " **self_attention_profile_context\n",
- " )\n",
- "\n",
- " dec_profiles_generation = dec_profiles_generation.add(\n",
- " f\"past_key_values.{i}.decoder.key\",\n",
- " **self_attention_profile_generation\n",
- " ).add(\n",
- " f\"past_key_values.{i}.decoder.value\",\n",
- " **self_attention_profile_generation\n",
- " )\n",
- "\n",
- "# TensorRT accepts multiple optimization engines for the same model.\n",
- "# Profile 1 is only used in the first decoder iterations.\n",
- "decoder_profiles = [dec_profiles_generation, dec_profiles_context]"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "4eadf843-9f60-41c7-90a9-098b33ce3603",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "kv_engine_path = os.path.join(trt_engine_folder, f\"{GPT2_VARIANT}-kv_cache_{engine_tag}.engine\")\n",
- "\n",
- "# Set up the trt engine with both kv input/output augmented\n",
- "if not os.path.exists(kv_engine_path):\n",
- " kv_gpt2_engine = GPT2ONNXFile(kv_onnx_path, kv_metadata).as_trt_engine(kv_engine_path,profiles=decoder_profiles, preview_features=preview_features)\n",
- "else:\n",
- " kv_gpt2_engine = GPT2TRTEngine(kv_engine_path, kv_metadata)\n",
- "\n",
- " \n",
- "kv_gpt2_trt = GPT2TRTDecoder(\n",
- " kv_gpt2_engine, kv_metadata, kv_config, batch_size=batch_size\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "090007db-9a09-4b6d-95ed-8a688ea05798",
- "metadata": {},
- "source": [
- "Since we have 2 profiles, benchmarking single-run runtime does not make sense. We instead use `full_inference` to measure the time for the entire inference cycle."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "a3b93d88-21bb-4f87-9ff6-709d0babdf34",
- "metadata": {},
- "outputs": [],
- "source": [
- "# get complete decoder inference result and its timing profile\n",
- "_, full_e2e_median_runtime = full_inference(\n",
- " kv_gpt2_trt, input_ids.cuda(), tokenizer, TimingProfile(iterations=10, number=1, warmup=1, duration=0, percentile=50),\n",
- " max_length=max_length, use_cache = use_cache\n",
- ")\n",
- "full_e2e_median_runtime"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "d89ab217-9ee4-435c-b689-69d98cef1cc4",
- "metadata": {},
- "outputs": [],
- "source": [
- "kv_gpt2_trt.reset()\n",
- "kv_sample_output = kv_gpt2_trt.generate(input_ids.cuda(), max_length=max_length)\n",
- "tokenizer.decode(kv_sample_output[0], skip_special_tokens=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "2b614fb8-63d6-4711-84cf-c69ca8b3f141",
- "metadata": {},
- "source": [
- "In this short example, kv cache performance does not improve the performance, and may even be slightly worse than non kv cache mode. However, when we have larger input sequences for the model, it will be better."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "f764049f-0578-4305-b010-4e7a3156a377",
- "metadata": {},
- "source": [
- "\n",
- "\n",
- "## 5. Advanced Topic: Beam Search\n",
- "\n",
- "Beam search is a way to increase the model quality. It looks for the top `num_beams` number of possible words and pick the one that conditions the best to the current position. Similarly, the original HuggingFace PyTorch model supports beam search natively, while we need to build separate trt engine for different `num_beams`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "5a5808db-2cc0-4d88-aebe-1b6e17a023e7",
- "metadata": {},
- "outputs": [],
- "source": [
- "beam_config = GPT2Config.from_pretrained(GPT2_VARIANT, use_cache = False)\n",
- "beam_metadata = NetworkMetadata(variant=GPT2_VARIANT, precision=Precision(fp16=False), other=GPT2Metadata(kv_cache=False))\n",
- "num_beams = 3"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "d1403609-b24d-4e10-a8eb-852d3eab6fa0",
- "metadata": {},
- "source": [
- "#### HuggingFace"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "cfd992c8-1eeb-427c-ae32-2c63766c6a69",
- "metadata": {},
- "outputs": [],
- "source": [
- "# get complete decoder inference result and its timing profile\n",
- "_, full_e2e_median_runtime = full_inference(\n",
- " model, input_ids, tokenizer, TimingProfile(iterations=10, number=1, warmup=1, duration=0, percentile=50),\n",
- " max_length=max_length, num_beams = num_beams\n",
- ")\n",
- "full_e2e_median_runtime"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "09418760-84bd-4308-b06b-8540945a6dcf",
- "metadata": {},
- "outputs": [],
- "source": [
- "sample_output = model.generate(input_ids, max_length=max_length, num_beams = num_beams)\n",
- "\n",
- "# de-tokenize model output to raw text\n",
- "tokenizer.decode(sample_output[0], skip_special_tokens=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "71b8d9fa-d74a-40dd-94ce-d98551d24608",
- "metadata": {},
- "source": [
- "You could see that the output is very different from the original one. If you change `num_beams`, the result will also change significantly."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "ba01e0ec-68ad-4682-8ca4-2ecde7d70f7f",
- "metadata": {},
- "source": [
- "#### TensorRT\n",
- "It uses the same onnx file as the original configuration, but the engine set up is differently, because it expands the inputs by `num_beams` for the first dimension of inputs."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "055fb314-8e0f-4edd-bf78-16890d196de4",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Create optimization profile for dynamic shape input. Can modify batch_size / max_sequence_length to build engines for different shapes\n",
- "batch_size = 1\n",
- "disable_preview_dynamic_shapes = False # preview_dynamic_shapes optimizes the trt engine building time\n",
- "# We can either use input length as the optimal length, or use max_length // 2. \n",
- "# In T5 or BART, input_length is better, but in GPT-2, max_length // 2 is better because we need to generate max_length number of tokens\n",
- "\n",
- "use_input_length = False\n",
- "opt_length = input_id.shape[1] if use_input_length else max_length // 2 \n",
- "# Create different engine tags for different configurations\n",
- "engine_tag = f\"bs{batch_size}-beam{num_beams}\"\n",
- "\n",
- "preview_features = [PreviewFeature.FASTER_DYNAMIC_SHAPES_0805]\n",
- "if disable_preview_dynamic_shapes:\n",
- " engine_tag += \"-disableFasterDynamicShapes\"\n",
- " preview_features = []\n",
- " \n",
- "\n",
- "beam_profiles = [Profile().add(\n",
- " \"input_ids\",\n",
- " min=(batch_size * num_beams, 1),\n",
- " opt=(batch_size * num_beams, opt_length), # Optimized based on the inputs. \n",
- " max=(batch_size * num_beams, max_length),\n",
- ")]"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "18986d0f-9509-463f-a489-a76dd4d28a88",
- "metadata": {},
- "outputs": [],
- "source": [
- "beam_profiles"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "cfd04a7b-8aa6-4c97-8d85-96f14b06abbc",
- "metadata": {},
- "outputs": [],
- "source": [
- "beam_engine_path = os.path.join(trt_engine_folder, f\"{GPT2_VARIANT}-{engine_tag}.engine\")\n",
- "if not os.path.exists(beam_engine_path):\n",
- " beam_gpt2_engine = GPT2ONNXFile(onnx_path, beam_metadata).as_trt_engine(output_fpath=beam_engine_path, profiles=beam_profiles, preview_features=preview_features)\n",
- "else:\n",
- " beam_gpt2_engine = GPT2TRTEngine(beam_engine_path, beam_metadata)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "18fe1dba-4e84-478e-9ea7-07c21856e6bd",
- "metadata": {},
- "outputs": [],
- "source": [
- "beam_gpt2_trt = GPT2TRTDecoder(beam_gpt2_engine, beam_metadata, beam_config, num_beams = num_beams)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "42614e14-c962-4c31-a469-7e0343efbdbb",
- "metadata": {},
- "outputs": [],
- "source": [
- "# get complete decoder inference result and its timing profile\n",
- "_, full_e2e_median_runtime = full_inference(\n",
- " beam_gpt2_trt, input_ids.cuda(), tokenizer, TimingProfile(iterations=10, number=1, warmup=1, duration=0, percentile=50),\n",
- " max_length=max_length, num_beams=num_beams\n",
- ")\n",
- "full_e2e_median_runtime"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "391ab05a-fe0d-42c3-9591-605ddab389ce",
- "metadata": {},
- "outputs": [],
- "source": [
- "beam_sample_output = beam_gpt2_trt.generate(input_ids.cuda(), max_length=max_length, num_beams=num_beams)\n",
- "tokenizer.decode(beam_sample_output[0], skip_special_tokens=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "9543dbfd-4650-46f5-8f77-587dcb05785a",
- "metadata": {},
- "source": [
- "We could see that because of larger batch size, beam search will take slightly longer, but for most sequences, it will generate more meaningful outputs."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "cbfc6c04-ca47-4fc6-9a12-ed500722bb4a",
- "metadata": {},
- "source": [
- "## Conclusion and where-to next?\n",
- "\n",
- "This notebook has walked you through the process of converting a HuggingFace PyTorch GPT-2 model to an optimized TensorRT engine for inference in 3 easy steps. The TensorRT inference engine can be conviniently used as a drop-in replacement for the orginial HuggingFace GPT-2 model while providing significant speed up. \n",
- "\n",
- "If you are interested in further details of the conversion process, check out [GPT2/trt.py](../GPT2/trt.py)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "14079b8f-738e-4137-9ca3-6a4254e8f006",
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "celltoolbar": "Edit Metadata",
- "kernelspec": {
- "display_name": "Python 3 (ipykernel)",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.8.10"
- },
- "vscode": {
- "interpreter": {
- "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
- }
- }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/demo/HuggingFace/notebooks/t5-playground.ipynb b/demo/HuggingFace/notebooks/t5-playground.ipynb
deleted file mode 100644
index d17a761c..00000000
--- a/demo/HuggingFace/notebooks/t5-playground.ipynb
+++ /dev/null
@@ -1,272 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "64974d33-d028-440c-86fa-1a0633b3d31d",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Copyright 2021 NVIDIA Corporation. All Rights Reserved.\n",
- "#\n",
- "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
- "# you may not use this file except in compliance with the License.\n",
- "# You may obtain a copy of the License at\n",
- "#\n",
- "# http://www.apache.org/licenses/LICENSE-2.0\n",
- "#\n",
- "# Unless required by applicable law or agreed to in writing, software\n",
- "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
- "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
- "# See the License for the specific language governing permissions and\n",
- "# limitations under the License.\n",
- "# =============================================================================="
- ]
- },
- {
- "cell_type": "markdown",
- "id": "c3f0ff46-9958-4d57-9067-a64be34e75da",
- "metadata": {},
- "source": [
- "\n",
- "\n",
- "# T5 Playground\n",
- "\n",
- "This notebook demonstrates T5 model on the task of translation and text summarization.\n",
- "\n",
- "The TensorRT HuggingFace T5 model is a plug-in replacement for the original PyTorch HuggingFace T5 model.\n",
- "\n",
- "\n",
- "\n",
- "**Notes**: \n",
- " - For \"CPU - PyTorch\" and \"GPU - PyTorch\", a T5 small model from HuggingFace model repository is employed. Inference is carried out with PyTorch in FP32 precision. All models run with batch size 1.\n",
- "Average run time across 5 runs is reported.\n",
- " - Prior to running this notebook, run [t5.ipynb](t5.ipynb) to download the T5 model and generate the TensorRT engine."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "3530e767-7050-4329-a4bc-e2221b9eb578",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "import os\n",
- "import sys\n",
- "ROOT_DIR = os.path.abspath(\"../\")\n",
- "sys.path.append(ROOT_DIR)\n",
- "\n",
- "import torch \n",
- "\n",
- "# huggingface\n",
- "from transformers import (\n",
- " T5ForConditionalGeneration,\n",
- " T5Tokenizer,\n",
- " T5Config,\n",
- ")\n",
- "from transformers.modeling_outputs import BaseModelOutput\n",
- "\n",
- "# download HuggingFace model and tokernizer\n",
- "T5_VARIANT = 't5-small'\n",
- "\n",
- "t5_model = T5ForConditionalGeneration.from_pretrained(T5_VARIANT)\n",
- "tokenizer = T5Tokenizer.from_pretrained(T5_VARIANT)\n",
- "config = T5Config.from_pretrained(T5_VARIANT, use_cache = False)\n",
- "\n",
- "# load TensorRT engine\n",
- "from T5.trt import T5TRTEncoder, T5TRTDecoder, TRTHFRunner\n",
- "from T5.T5ModelConfig import T5ModelTRTConfig, T5Metadata\n",
- "from T5.export import T5DecoderTRTEngine, T5EncoderTRTEngine\n",
- "from NNDF.networks import NetworkMetadata, Precision\n",
- "\n",
- "from transformers.generation_stopping_criteria import (\n",
- " MaxLengthCriteria,\n",
- " StoppingCriteriaList,\n",
- ")\n",
- "\n",
- "metadata=NetworkMetadata(variant=T5_VARIANT, precision=Precision(fp16=True), other=T5Metadata(kv_cache=False))\n",
- "\n",
- "from os.path import exists\n",
- "encoder_path = './models/{}/tensorrt/{}-encoder.onnx-bs1-previewFasterDynamicShapes.engine'.format(T5_VARIANT,T5_VARIANT)\n",
- "if not exists(encoder_path):\n",
- " print(\"Error: TensorRT engine not found at {}. Please run t5.ipynb to generate the TensorRT engine first!\".format(encoder_path))\n",
- "else:\n",
- " encoder_engine = T5EncoderTRTEngine('./models/{}/tensorrt/{}-encoder.onnx-bs1-previewFasterDynamicShapes.engine'.format(T5_VARIANT,T5_VARIANT), metadata)\n",
- " decoder_engine = T5DecoderTRTEngine('./models/{}/tensorrt/{}-decoder-with-lm-head.onnx-bs1-previewFasterDynamicShapes.engine'.format(T5_VARIANT,T5_VARIANT), metadata)\n",
- "\n",
- "t5_trt_encoder = T5TRTEncoder(encoder_engine, metadata, config)\n",
- "t5_trt_decoder = T5TRTDecoder(decoder_engine, metadata, config)\n",
- "\n",
- "decoder_input_ids = torch.full(\n",
- " (1, 1), tokenizer.convert_tokens_to_ids(tokenizer.pad_token), dtype=torch.int32\n",
- ").to(\"cuda:0\")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "766b8c94-ba8e-47c8-8624-57da462a0496",
- "metadata": {
- "tags": []
- },
- "outputs": [],
- "source": [
- "import ipywidgets as widgets\n",
- "import numpy as np\n",
- "import time\n",
- "\n",
- "device = widgets.RadioButtons(\n",
- " options=['CPU - PyTorch', \n",
- " 'GPU - PyTorch', \n",
- " 'GPU - TensorRT'],\n",
- " description='Device:',\n",
- " disabled=False\n",
- ")\n",
- "\n",
- "task = widgets.RadioButtons(\n",
- " options=['En -> German', \n",
- " 'Summarize', \n",
- " ],\n",
- " description='Task:',\n",
- " disabled=False\n",
- ")\n",
- "\n",
- "paragraph_text = widgets.Textarea(\n",
- " value='TensorRT is a high performance deep learning inference platform that delivers low latency and high throughput for apps'\\\n",
- " 'such as recommenders, speech and image/video on NVIDIA GPUs. It includes parsers to import models, and plugins to support novel ops'\\\n",
- " 'and layers before applying optimizations for inference. Today NVIDIA is open-sourcing parsers and plugins in TensorRT so that the deep'\\\n",
- " 'learning community can customize and extend these components to take advantage of powerful TensorRT optimizations for your apps.',\n",
- " placeholder='Type something',\n",
- " description='Context:',\n",
- " disabled=False,\n",
- " layout=widgets.Layout(width=\"auto\"),\n",
- " rows=5, \n",
- ")\n",
- "\n",
- "\n",
- "generated_text = widgets.Textarea(\n",
- " value='...',\n",
- " placeholder='Context',\n",
- " description='T5 output:',\n",
- " disabled=False,\n",
- " layout=widgets.Layout(width=\"auto\"),\n",
- " rows=5,\n",
- ")\n",
- "button = widgets.Button(description=\"Generate\")\n",
- "\n",
- "display(paragraph_text)\n",
- "display(generated_text)\n",
- "display(device)\n",
- "display(task)\n",
- "\n",
- "from IPython.display import display\n",
- "box_layout = widgets.Layout(display='flex',\n",
- " flex_flow='column',\n",
- " align_items='center',\n",
- " width='100%')\n",
- "N_RUN = 6\n",
- "progress_bar = widgets.IntProgress(\n",
- " value=0,\n",
- " min=0,\n",
- " max=N_RUN,\n",
- " description='Progress:',\n",
- " bar_style='', # 'success', 'info', 'warning', 'danger' or ''\n",
- " style={'bar_color': 'green'},\n",
- " orientation='horizontal', \n",
- " layout=widgets.Layout(width='100%', height='50px')\n",
- ")\n",
- "\n",
- "box = widgets.HBox(children=[button],layout=box_layout)\n",
- "output = widgets.Output()\n",
- "display(box)\n",
- "display(progress_bar)\n",
- "display(output)\n",
- "\n",
- "MAX_LENGTH = 256\n",
- "\n",
- "def generate(b):\n",
- " progress_bar.value = 0\n",
- " inference_time_arr = []\n",
- " prefix = 'translate English to German' if task.value=='En -> German' else 'summarize'\n",
- " inputs = tokenizer(\"{}: {}\".format(prefix, paragraph_text.value), return_tensors=\"pt\")\n",
- " with output:\n",
- " if device.value == 'GPU - TensorRT':\n",
- " for _ in range(N_RUN):\n",
- " start_time = time.time()\n",
- " encoder_last_hidden_state = t5_trt_encoder(input_ids=inputs.input_ids.to('cuda:0'))\n",
- " outputs = t5_trt_decoder.generate(\n",
- " inputs.input_ids.to('cuda:0'),\n",
- " max_length = MAX_LENGTH,\n",
- " min_length = 1,\n",
- " eos_token_id = t5_trt_decoder.config.eos_token_id,\n",
- " pad_token_id = t5_trt_decoder.config.pad_token_id,\n",
- " encoder_outputs = BaseModelOutput(last_hidden_state = encoder_last_hidden_state.to('cuda:0')),\n",
- " )\n",
- " inference_time_arr.append(time.time()-start_time)\n",
- " progress_bar.value += 1\n",
- "\n",
- " # de-tokenize model output to raw text\n",
- " text = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
- " generated_text.value = text\n",
- " print(\"GPU - TensorRT - Average inference time: %.2f (ms)\"%(1000*np.mean(inference_time_arr[1:]))) \n",
- " \n",
- " elif device.value == 'CPU - PyTorch':\n",
- " for _ in range(N_RUN):\n",
- " start_time = time.time()\n",
- " outputs = t5_model.to('cpu').generate(inputs.input_ids.to('cpu'), max_length=MAX_LENGTH)\n",
- " inference_time_arr.append(time.time()-start_time)\n",
- " progress_bar.value += 1\n",
- "\n",
- " # de-tokenize model output to raw text\n",
- " text = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
- " generated_text.value = text\n",
- " print(\"CPU - PyTorch - Average inference time: %.2f (ms)\"%(1000*np.mean(inference_time_arr[1:])))\n",
- " \n",
- " elif device.value == 'GPU - PyTorch': \n",
- " for _ in range(N_RUN):\n",
- " start_time = time.time()\n",
- " outputs = t5_model.to('cuda:0').generate(inputs.input_ids.to('cuda:0'), max_length=MAX_LENGTH)\n",
- " inference_time_arr.append(time.time()-start_time)\n",
- " progress_bar.value += 1\n",
- "\n",
- " # de-tokenize model output to raw text\n",
- " text = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
- " generated_text.value = text\n",
- " print(\"GPU - PyTorch - Average inference time: %.2f (ms)\"%(1000*np.mean(inference_time_arr[1:]))) \n",
- " \n",
- "button.on_click(generate)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "58f473c0-6682-41af-8040-72f0a9472b0f",
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3 (ipykernel)",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.8.10"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/demo/HuggingFace/notebooks/t5.ipynb b/demo/HuggingFace/notebooks/t5.ipynb
deleted file mode 100644
index c708e04e..00000000
--- a/demo/HuggingFace/notebooks/t5.ipynb
+++ /dev/null
@@ -1,664 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "28e6e614-e360-4292-965e-0d255027e9b9",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Copyright 2021 NVIDIA Corporation. All Rights Reserved.\n",
- "#\n",
- "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
- "# you may not use this file except in compliance with the License.\n",
- "# You may obtain a copy of the License at\n",
- "#\n",
- "# http://www.apache.org/licenses/LICENSE-2.0\n",
- "#\n",
- "# Unless required by applicable law or agreed to in writing, software\n",
- "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
- "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
- "# See the License for the specific language governing permissions and\n",
- "# limitations under the License.\n",
- "# =============================================================================="
- ]
- },
- {
- "cell_type": "markdown",
- "id": "9b88dc1a-a92d-44cc-9fb7-d9e2ef20c8e2",
- "metadata": {},
- "source": [
- "\n",
- "\n",
- "# Accelerating HuggingFace T5 Inference with TensorRT\n",
- "\n",
- "T5 is an encoder-decoder model that converts all NLP problems into a text-to-text format. More specifically, it does so by encoding different tasks as text directives in the input stream. This enables a single model to be trained supervised on a wide variety of NLP tasks such as translation, classification, Q&A and summarization.\n",
- "\n",
- "This notebook shows 3 easy steps to convert a [HuggingFace PyTorch T5 model](https://huggingface.co/transformers/model_doc/t5.html) to a TensorRT engine for high-performance inference.\n",
- "\n",
- "1. [Download HuggingFace T5 model](#1)\n",
- "1. [Convert to ONNX format](#2)\n",
- "1. [Convert to TensorRT engine](#3)\n",
- "\n",
- "## Prerequisite\n",
- "\n",
- "Follow the instruction at https://github.com/NVIDIA/TensorRT to build the TensorRT-OSS docker container required to run this notebook.\n",
- "\n",
- "Next, we install some extra dependencies."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "0c36ecb7-c622-4d95-a851-b9a6eb18e81b",
- "metadata": {},
- "outputs": [],
- "source": [
- "%%capture\n",
- "!pip3 install -r ../requirements.txt"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "a1bbdafb",
- "metadata": {},
- "source": [
- "**Note:** After this step, you should restart the Jupyter kernel for the change to take effect."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "235d2f1b-439e-4cd0-8286-1d63a13f2cf3",
- "metadata": {},
- "outputs": [],
- "source": [
- "import os\n",
- "import sys\n",
- "ROOT_DIR = os.path.abspath(\"../\")\n",
- "sys.path.append(ROOT_DIR)\n",
- "\n",
- "import torch\n",
- "import tensorrt as trt\n",
- "\n",
- "# huggingface\n",
- "from transformers import (\n",
- " T5ForConditionalGeneration,\n",
- " T5Tokenizer,\n",
- " T5Config,\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "af4254e2-11fd-4bc7-ac0b-60b1a9e07c4e",
- "metadata": {},
- "source": [
- "\n",
- "\n",
- "## 1. Download HuggingFace T5 model\n",
- "\n",
- "First, we download the original HuggingFace PyTorch T5 model from HuggingFace model hubs, together with its associated tokernizer.\n",
- "\n",
- "The T5 variants that are suported by TensorRT 8 are: t5-small (60M), t5-base (220M), t5-large (770M), t5-3b(3B), t5-11b(11B)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "fae66d58-f994-4987-8f1d-1fa8ac2ec8b4",
- "metadata": {},
- "outputs": [],
- "source": [
- "T5_VARIANT = 't5-small' # choices: t5-small | t5-base | t5-large | t5-3b | t5-11b\n",
- "\n",
- "t5_model = T5ForConditionalGeneration.from_pretrained(T5_VARIANT)\n",
- "tokenizer = T5Tokenizer.from_pretrained(T5_VARIANT)\n",
- "config = T5Config.from_pretrained(T5_VARIANT, use_cache = False)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "7252ca90-1104-40dc-8e72-f51c07a4cd11",
- "metadata": {},
- "outputs": [],
- "source": [
- "# save model locally\n",
- "pytorch_model_dir = './models/{}/pytorch'.format(T5_VARIANT)\n",
- "!mkdir -p $pytorch_model_dir\n",
- "\n",
- "t5_model.save_pretrained(pytorch_model_dir)\n",
- "print(\"Pytorch Model saved to {}\".format(pytorch_model_dir))"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "11ea023d-c4d4-43bb-9d77-c76684e0b06f",
- "metadata": {},
- "source": [
- "### Inference with PyTorch model\n",
- "\n",
- "Next, we will carry out inference with the PyTorch model.\n",
- "\n",
- "#### Single example inference"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "544dea73",
- "metadata": {},
- "outputs": [],
- "source": [
- "inputs = tokenizer(\"translate English to German: That is good.\", return_tensors=\"pt\")\n",
- "input_ids = inputs.input_ids\n",
- "num_beams = 1"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "ed1edf8a",
- "metadata": {},
- "outputs": [],
- "source": [
- "# WAR: Using an ugly representation because cuda 11.4 does not support GPU models due to cublas errors\n",
- "if \"cuda-11.4\" in os.environ[\"LD_LIBRARY_PATH\"]:\n",
- " t5_model = t5_model.cpu()\n",
- " input_ids = input_ids.cpu()\n",
- " inputs = inputs.to('cpu')\n",
- "else:\n",
- " t5_model = t5_model.cuda()\n",
- " input_ids = input_ids.cuda()\n",
- " inputs = inputs.to('cuda:0')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "13913fd9",
- "metadata": {},
- "outputs": [],
- "source": [
- "# inference on a single example\n",
- "t5_model.eval()\n",
- "with torch.no_grad():\n",
- " outputs = t5_model(**inputs, labels=inputs[\"input_ids\"])\n",
- "\n",
- "logits = outputs.logits"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "98f7fd8b-2ee3-4d25-9204-7713eb7e90b3",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Generate sequence for an input\n",
- "outputs = t5_model.generate(input_ids, num_beams=num_beams)\n",
- "print(tokenizer.decode(outputs[0], skip_special_tokens=True))"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "667fcacc-02cb-415d-a9ff-2d2ec44ef225",
- "metadata": {},
- "source": [
- "#### Model inference benchmark: encoder and decoder stacks\n",
- "\n",
- "For benchmarking purposes, we will employ a helper functions `encoder_inference` and `decoder_inference` which execute the inference repeatedly for the T5 encoder and decoder stacks separately, and measure end to end execution time. Let's take note of this execution time for comparison with TensorRT. \n",
- " \n",
- "`TimingProfile` is a named tuple that specifies the number of experiments and number of times to call the function per iteration (and number of warm-up calls although it is not used here)."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "596ea542-d9e5-4367-b643-d60027fa05e6",
- "metadata": {},
- "outputs": [],
- "source": [
- "from T5.measurements import decoder_inference, encoder_inference, full_inference\n",
- "from T5.export import T5EncoderTorchFile, T5DecoderTorchFile, T5EncoderTRTEngine, T5DecoderTRTEngine\n",
- "from NNDF.networks import TimingProfile\n",
- "from NNDF.torch_utils import expand_inputs_for_beam_search\n",
- "\n",
- "t5_torch_encoder = T5EncoderTorchFile.TorchModule(t5_model.encoder)\n",
- "t5_torch_decoder = T5DecoderTorchFile.TorchModule(\n",
- " t5_model.decoder, t5_model.lm_head, t5_model.config\n",
- ")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "be755fbc-c53e-4f8d-a9c2-4817167cf93a",
- "metadata": {},
- "outputs": [],
- "source": [
- "input_ids = inputs.input_ids\n",
- "\n",
- "encoder_last_hidden_state, encoder_e2e_median_time = encoder_inference(\n",
- " t5_torch_encoder, input_ids, TimingProfile(iterations=10, number=1, warmup=1, duration=0, percentile=50)\n",
- ")\n",
- "encoder_e2e_median_time"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "960f05fc-f572-4832-ad82-8a75823866b1",
- "metadata": {},
- "outputs": [],
- "source": [
- "_, decoder_e2e_median_time = decoder_inference(\n",
- " t5_torch_decoder, input_ids, encoder_last_hidden_state, TimingProfile(iterations=10, number=1, warmup=1, duration=0, percentile=50)\n",
- ")\n",
- "decoder_e2e_median_time"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "a99d5a06-a8f5-4ce7-a34c-bc42f07ac706",
- "metadata": {},
- "source": [
- "#### Full model inference and benchmark\n",
- "\n",
- "Next, we will try the T5 model for the task of translation from English to German.\n",
- "\n",
- "For benchmarking purposes, we will employ a helper function `full_inference` which executes the inference repeatedly and measures end to end execution time. Let's take note of this execution time for comparison with TensorRT. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "39d511cf-d963-4629-be54-22e9a258716d",
- "metadata": {},
- "outputs": [],
- "source": [
- "from T5.T5ModelConfig import T5ModelTRTConfig, T5Metadata\n",
- "decoder_output, full_e2e_median_runtime = full_inference(\n",
- " t5_torch_encoder,\n",
- " t5_torch_decoder,\n",
- " input_ids,\n",
- " tokenizer,\n",
- " TimingProfile(iterations=10, number=1, warmup=1, duration=0, percentile=50),\n",
- " num_beams=num_beams,\n",
- " max_length=T5ModelTRTConfig.MAX_SEQUENCE_LENGTH[T5_VARIANT],\n",
- ")\n",
- "full_e2e_median_runtime"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "8cff48fc-b792-4852-b638-6e2c54099cb2",
- "metadata": {},
- "source": [
- "Let us decode the model's output back into text."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "839bc6bc-65dc-499d-ac26-81456dbc1748",
- "metadata": {},
- "outputs": [],
- "source": [
- "# De-tokenize output to raw text\n",
- "print(tokenizer.decode(decoder_output[0], skip_special_tokens=True))"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "0d662701-e430-4fdc-ad46-1f296defcf8f",
- "metadata": {},
- "source": [
- "\n",
- "\n",
- "## 2. Convert to ONNX\n",
- "\n",
- "Prior to converting the model to a TensorRT engine, we will first convert the PyTorch model to an intermediate universal format.\n",
- "\n",
- "ONNX is an open format for machine learning and deep learning models. It allows you to convert deep learning and machine learning models from different frameworks such as TensorFlow, PyTorch, MATLAB, Caffe, and Keras to a single format.\n",
- "\n",
- "The steps to convert a PyTorch model to TensorRT are as follows:\n",
- "- Convert the pretrained image segmentation PyTorch model into ONNX.\n",
- "- Import the ONNX model into TensorRT.\n",
- "- Apply optimizations and generate an engine.\n",
- "- Perform inference on the GPU. \n",
- "\n",
- "For the T5 model, we will convert the encoder and decoder seperately."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "c2b2be1a-021c-4f6c-957d-2ff7d1b95976",
- "metadata": {},
- "outputs": [],
- "source": [
- "# helpers\n",
- "from NNDF.networks import NetworkMetadata, Precision"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "c50346f7-6c2c-4e4b-ba70-875688947b75",
- "metadata": {},
- "outputs": [],
- "source": [
- "onnx_model_path = './models/{}/ONNX'.format(T5_VARIANT)\n",
- "\n",
- "metadata=NetworkMetadata(variant=T5_VARIANT, precision=Precision(fp16=True), other=T5Metadata(kv_cache=False))\n",
- "\n",
- "encoder_onnx_model_path = os.path.join(onnx_model_path, \"encoder\")\n",
- "decoder_onnx_model_path = os.path.join(onnx_model_path, \"decoder\")\n",
- "!mkdir -p $encoder_onnx_model_path\n",
- "!mkdir -p $decoder_onnx_model_path\n",
- "\n",
- "encoder_onnx_model_fpath = T5_VARIANT + \"-encoder.onnx\"\n",
- "decoder_onnx_model_fpath = T5_VARIANT + \"-decoder-with-lm-head.onnx\"\n",
- "\n",
- "t5_encoder = T5EncoderTorchFile(t5_model.to('cpu'), metadata)\n",
- "t5_decoder = T5DecoderTorchFile(t5_model.to('cpu'), metadata)\n",
- "\n",
- "onnx_t5_encoder = t5_encoder.as_onnx_model(\n",
- " os.path.join(encoder_onnx_model_path, encoder_onnx_model_fpath), force_overwrite=False\n",
- ")\n",
- "onnx_t5_decoder = t5_decoder.as_onnx_model(\n",
- " os.path.join(decoder_onnx_model_path, decoder_onnx_model_fpath), force_overwrite=False\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "7baf007e-5508-485c-a87f-9bfe16260452",
- "metadata": {},
- "source": [
- "\n",
- "\n",
- "## 3. Convert to TensorRT\n",
- "\n",
- "Now we are ready to parse the ONNX encoder and decoder models and convert them to optimized TensorRT engines.\n",
- "\n",
- "Since the models contains dynamic input shapes, we can specify a valid input range with a TensorRT optimization profile."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "037ac958-2627-439c-9db5-27640e3f7967",
- "metadata": {},
- "outputs": [],
- "source": [
- "from T5.export import T5DecoderONNXFile, T5EncoderONNXFile\n",
- "from polygraphy.backend.trt import Profile\n",
- "from tensorrt import PreviewFeature"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "6bd6e3fc-6797-46b0-a211-ce42d3769105",
- "metadata": {},
- "outputs": [],
- "source": [
- "tensorrt_model_path = './models/{}/tensorrt'.format(T5_VARIANT)\n",
- "!mkdir -p tensorrt_model_path\n",
- "# Decoder optimization profiles\n",
- "batch_size = 1\n",
- "max_sequence_length = T5ModelTRTConfig.MAX_SEQUENCE_LENGTH[T5_VARIANT]\n",
- "decoder_profile = Profile()\n",
- "decoder_profile.add(\n",
- " \"input_ids\",\n",
- " min=(batch_size * num_beams, 1),\n",
- " opt=(batch_size * num_beams, max_sequence_length // 2),\n",
- " max=(batch_size * num_beams, max_sequence_length),\n",
- ")\n",
- "decoder_profile.add(\n",
- " \"encoder_hidden_states\",\n",
- " min=(batch_size * num_beams, 1, max_sequence_length),\n",
- " opt=(batch_size * num_beams, max_sequence_length // 2, max_sequence_length),\n",
- " max=(batch_size * num_beams, max_sequence_length, max_sequence_length),\n",
- ")\n",
- "\n",
- "# Encoder optimization profiles\n",
- "encoder_profile = Profile()\n",
- "encoder_profile.add(\n",
- " \"input_ids\",\n",
- " min=(batch_size, 1),\n",
- " opt=(batch_size, max_sequence_length // 2),\n",
- " max=(batch_size, max_sequence_length),\n",
- ")\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "cfb64120-9012-40c8-b1e2-4a6366b71294",
- "metadata": {},
- "outputs": [],
- "source": [
- "disable_preview_dynamic_shapes = False\n",
- "engine_tag = f\"bs{batch_size}\"\n",
- "\n",
- "if num_beams > 1:\n",
- " engine_tag += \"-beam{}\".format(num_beams)\n",
- "\n",
- "preview_features = [PreviewFeature.DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]\n",
- "if disable_preview_dynamic_shapes:\n",
- " engine_tag += \"-noFasterDynamicShapes\"\n",
- "else:\n",
- " preview_features += [PreviewFeature.FASTER_DYNAMIC_SHAPES_0805]\n",
- "\n",
- "encoder_engine_name = os.path.join(tensorrt_model_path, encoder_onnx_model_fpath) + f\"-{engine_tag}.engine\".replace(f\"-beam{num_beams}\", \"\") # encoder engine not affected by beam search\n",
- "decoder_engine_name = os.path.join(tensorrt_model_path, decoder_onnx_model_fpath) + f\"-{engine_tag}.engine\"\n",
- "\n",
- "if not os.path.exists(encoder_engine_name):\n",
- " t5_trt_encoder_engine = T5EncoderONNXFile(os.path.join(encoder_onnx_model_path, encoder_onnx_model_fpath), metadata).as_trt_engine(\n",
- " encoder_engine_name,\n",
- " profiles=[encoder_profile],\n",
- " preview_features=preview_features)\n",
- "else:\n",
- " t5_trt_encoder_engine = T5EncoderTRTEngine(encoder_engine_name, metadata)\n",
- "\n",
- "if not os.path.exists(decoder_engine_name):\n",
- " t5_trt_decoder_engine = T5DecoderONNXFile(os.path.join(decoder_onnx_model_path, decoder_onnx_model_fpath), metadata).as_trt_engine(\n",
- " decoder_engine_name,\n",
- " profiles=[decoder_profile],\n",
- " preview_features=preview_features)\n",
- "else:\n",
- " t5_trt_decoder_engine = T5DecoderTRTEngine(decoder_engine_name, metadata)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "74f7f6fc-1e6a-4ddc-8e9b-543d9e8dab4d",
- "metadata": {
- "tags": []
- },
- "source": [
- "### Inference with TensorRT engine\n",
- "\n",
- "Great, if you have reached this stage, it means we now have an optimized TensorRT engine for the T5 model, ready for us to carry out inference. \n",
- "\n",
- "#### Single example inference\n",
- "The T5 model with TensorRT backend can now be employed in place of the original HuggingFace T5 model.\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "3954f2f4-c393-463b-a44b-3e5335032b57",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Initialize TensorRT engines\n",
- "from T5.trt import T5TRTEncoder, T5TRTDecoder\n",
- "\n",
- "t5_trt_encoder = T5TRTEncoder(\n",
- " t5_trt_encoder_engine, metadata, config\n",
- " )\n",
- "t5_trt_decoder = T5TRTDecoder(\n",
- " t5_trt_decoder_engine, metadata, config, num_beams=num_beams\n",
- " )"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "a9544ecb-2671-4b53-a544-08f13424cefe",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Inference on a single sample\n",
- "encoder_last_hidden_state = t5_trt_encoder(input_ids=input_ids)\n",
- "outputs = t5_trt_decoder(\n",
- " expand_inputs_for_beam_search(input_ids, num_beams) if num_beams > 1 else input_ids, \n",
- " expand_inputs_for_beam_search(encoder_last_hidden_state, num_beams) if num_beams > 1 else encoder_last_hidden_state)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "8d71a327-546f-4b5b-bd42-caaffcceafc7",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Generate sequence for an input\n",
- "max_length = 64\n",
- "\n",
- "decoder_input_ids = torch.full(\n",
- " (1, 1), tokenizer.convert_tokens_to_ids(tokenizer.pad_token), dtype=torch.int32\n",
- ").to(\"cuda:0\")\n",
- "\n",
- "encoder_last_hidden_state = t5_trt_encoder(input_ids=input_ids)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "ed9d4a98-b034-470e-a9f8-096d4100b8d4",
- "metadata": {},
- "source": [
- "#### TRT engine inference benchmark: encoder and decoder stacks\n",
- "First, we will bechmark the encoder and decoder stacks as before."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "70b37591-4398-40ff-8a39-5f75347192dc",
- "metadata": {},
- "outputs": [],
- "source": [
- "encoder_last_hidden_state, encoder_e2e_median_time = encoder_inference(\n",
- " t5_trt_encoder, input_ids, TimingProfile(iterations=10, number=1, warmup=1, duration=0, percentile=50)\n",
- ")\n",
- "encoder_e2e_median_time\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "7e5459da-a01b-4894-88dc-01b3637ded53",
- "metadata": {},
- "outputs": [],
- "source": [
- "_, decoder_e2e_median_time = decoder_inference(\n",
- " t5_trt_decoder, expand_inputs_for_beam_search(input_ids, num_beams) if num_beams > 1 else input_ids, \n",
- " expand_inputs_for_beam_search(encoder_last_hidden_state, num_beams) if num_beams > 1 else encoder_last_hidden_state, TimingProfile(iterations=10, number=1, warmup=1, duration=0, percentile=50)\n",
- ")\n",
- "decoder_e2e_median_time"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "62ebfe03-7a60-4dd0-ad32-4e53d6012b07",
- "metadata": {},
- "source": [
- "### Full model inference benchmark\n",
- "\n",
- "Next, we will try the full TensorRT T5 engine for the task of translation. As before, note the time difference."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "f31cb550-24b9-48cd-a4ec-0bf18ac5e40c",
- "metadata": {},
- "outputs": [],
- "source": [
- "decoder_output, full_e2e_median_runtime = full_inference(\n",
- " t5_trt_encoder,\n",
- " t5_trt_decoder,\n",
- " input_ids,\n",
- " tokenizer,\n",
- " TimingProfile(iterations=10, number=1, warmup=1, duration=0, percentile=50),\n",
- " max_length=T5ModelTRTConfig.MAX_SEQUENCE_LENGTH[metadata.variant],\n",
- " num_beams=num_beams,\n",
- " use_cuda=True,\n",
- ")\n",
- "\n",
- "print(tokenizer.decode(decoder_output[0], skip_special_tokens=True))\n",
- "full_e2e_median_runtime\n"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "92031643-8ee8-4d50-864b-a08e4d551dc6",
- "metadata": {},
- "source": [
- "You can now compare the output of the original PyTorch model and the TensorRT engine. Notice the speed difference. On an NVIDIA V100 32GB GPU, this results in upto ~10x performance improvement (from 0.0802s to 0.0082s for the T5-small variant)."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "2a1f5dca-397c-4c8c-9200-61b30cdba824",
- "metadata": {},
- "source": [
- "## Conclusion and where-to next?\n",
- "\n",
- "This notebook has walked you through the process of converting a HuggingFace PyTorch T5 model to an optimized TensorRT engine for inference in 3 easy steps. The TensorRT inference engine can be conviniently used as a drop-in replacement for the orginial HuggingFace T5 model while providing significant speed up. \n",
- "\n",
- "If you are interested in further details of the conversion process, check out [T5/trt.py](../T5/trt.py)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "b6a8b7c8",
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3 (ipykernel)",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.10.6"
- },
- "vscode": {
- "interpreter": {
- "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
- }
- }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/demo/HuggingFace/requirements.txt b/demo/HuggingFace/requirements.txt
deleted file mode 100644
index 30d9cdb1..00000000
--- a/demo/HuggingFace/requirements.txt
+++ /dev/null
@@ -1,31 +0,0 @@
-#
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-huggingface-hub==0.11.0; python_version>="3.7"
-huggingface-hub==0.4.0; python_version<"3.7"
-transformers==4.20.0; python_version>="3.7"
-transformers==4.18.0; python_version<"3.7"
-torch==1.13.1; python_version>="3.7"
-torch==1.10; python_version<"3.7"
-sentencepiece==0.1.95; python_version<"3.10"
-sentencepiece==0.1.97; python_version>="3.10"
---extra-index-url https://pypi.ngc.nvidia.com
-onnx==1.9.0; python_version<"3.8"
-onnx==1.13.1; python_version>="3.8"
-polygraphy>=0.42.2
-tabulate
-toml
-onnx_graphsurgeon
diff --git a/demo/HuggingFace/run.py b/demo/HuggingFace/run.py
deleted file mode 100644
index 3521b57f..00000000
--- a/demo/HuggingFace/run.py
+++ /dev/null
@@ -1,312 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Demonstrates TensorRT capabilities with networks located in HuggingFace repository.
-Requires Python 3.5+
-"""
-
-import os
-import sys
-import pickle
-import argparse
-import importlib
-
-from abc import abstractmethod
-from typing import List
-
-# tabulate
-from tabulate import tabulate
-
-ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
-sys.path.append(ROOT_DIR)
-
-# Wrapper actions supported
-WRAPPER_RUN_ACTION = "run"
-WRAPPER_LIST_ACTION = "list"
-WRAPPER_COMPARE_ACTION = "compare"
-WRAPPER_BENCHMARK_ACTION = "benchmark"
-WRAPPER_ACTIONS = [WRAPPER_RUN_ACTION, WRAPPER_LIST_ACTION, WRAPPER_COMPARE_ACTION, WRAPPER_BENCHMARK_ACTION]
-
-# NNDF
-from NNDF.general_utils import process_per_result_entries, process_results, register_network_folders, RANDOM_SEED
-from NNDF.logger import G_LOGGER
-from NNDF.cuda_bootstrapper import bootstrap_ld_library_path
-
-# huggingface
-from transformers import set_seed
-
-# Force seed to 42 for reproducibility.
-set_seed(RANDOM_SEED)
-
-class Action:
- def __init__(self, networks: List[str], parser: argparse.ArgumentParser):
- self.networks = networks
- self.parser = parser
- self.add_args(self.parser)
-
- @abstractmethod
- def execute(self, args: argparse.Namespace):
- pass
-
- @abstractmethod
- def add_args(self, parser: argparse.ArgumentParser):
- pass
-
-
-class NetworkScriptAction(Action):
-
- # Reserved files names for each network folder
- FRAMEWORKS_SCRIPT_NAME = "frameworks"
- TRT_SCRIPT_NAME = "trt"
- ONNX_SCRIPT_NAME = "onnxrt"
- PER_NETWORK_SCRIPTS = [FRAMEWORKS_SCRIPT_NAME, TRT_SCRIPT_NAME, ONNX_SCRIPT_NAME]
-
- def add_args(self, parser):
- network_group = parser.add_argument_group("specify network")
- network_group.add_argument(
- "network", help="Network to run.", choices=self.networks
- )
-
- def load_script(self, script_name: str, args: argparse.Namespace):
- """Helper for loading a specific script for given network."""
- assert (
- script_name in self.PER_NETWORK_SCRIPTS
- ), "Script must be a reserved name."
-
- # Load the specific commandline script
- return importlib.import_module("{}.{}".format(args.network, script_name))
-
-
-class RunAction(NetworkScriptAction):
- def execute(self, args: argparse.Namespace):
- module = self.load_script(args.script, args)
- module.RUN_CMD._parser = self.parser
-
- old_path = os.getcwd()
- # Execute script in each relevant folder
- try:
- os.chdir(args.network)
- results = module.RUN_CMD()
- finally:
- os.chdir(old_path)
-
- # Output to terminal
- print(results)
-
- # Dump results as a pickle file if applicable.
- # Useful for testing or post-processing.
- if args.save_output_fpath:
- with open(args.save_output_fpath, "wb") as f:
- pickle.dump(results, f)
-
- return 0
-
- def add_args(self, parser: argparse.ArgumentParser):
- super().add_args(parser)
- run_group = parser.add_argument_group("run args")
- run_group.add_argument("script", choices=self.PER_NETWORK_SCRIPTS)
- run_group.add_argument("--save-output-fpath", "-o", default=None, help="Outputs a pickled NetworkResult object. See networks.py for definition.")
-
-
-class BenchmarkAction(NetworkScriptAction):
- def execute(self, args: argparse.Namespace):
- module = self.load_script(args.script, args)
- module.RUN_CMD._parser = self.parser
-
- old_path = os.getcwd()
- # Execute script in each relevant folder
- try:
- os.chdir(args.network)
- results = module.RUN_CMD.run_benchmark()
- finally:
- os.chdir(old_path)
-
- # Output to terminal
- print(results)
-
- return 0
-
- def add_args(self, parser: argparse.ArgumentParser):
- super().add_args(parser)
- run_group = parser.add_argument_group("benchmark args")
- run_group.add_argument("script", choices=self.PER_NETWORK_SCRIPTS)
-
-
-class CompareAction(NetworkScriptAction):
- GENERAL_HEADERS = ["script", "accuracy"]
-
- def execute(self, args: argparse.Namespace):
- compare_group = []
- if args.compare is None:
- compare_group = self.PER_NETWORK_SCRIPTS
- else:
- compare_group = args.compare
-
- if len(compare_group) <= 1:
- G_LOGGER.error(
- "Comparison command must have atleast two groups to compare to."
- )
- exit()
-
- results = []
- # Get the parser for inference script which is a superset
- module = None
- try:
- module = self.load_script(self.TRT_SCRIPT_NAME, args)
- except ModuleNotFoundError as e:
- print("Unable to do comparison. TRT script not yet supported.")
- exit(1)
-
- nconfig = module.RUN_CMD.config
- nconfig.MetadataClass.add_inference_args(self.parser)
- self.parser.parse_known_args()
-
- results = []
- # It is possible certain scripts are not implemented
- # Allow the results to generate even if script does not exist.
- modified_compare_group = []
- for g in compare_group:
- cwd = os.getcwd()
- try:
- print()
- print("Collecting Data for {}".format(g))
- os.chdir(args.network)
- module = self.load_script(g, args)
- module.RUN_CMD._parser = self.parser
- results.append(module.RUN_CMD())
- modified_compare_group.append(g)
- except ModuleNotFoundError as e:
- print("{} is not valid, the demo does not support this script yet. Ignoring.".format(g))
-
- finally:
- os.chdir(cwd)
-
- headers, rows = process_per_result_entries(modified_compare_group, results)
- # Rows are grouped by input, flatten to show as one large table
- flattened_rows = [r for input_row in rows.values() for r in input_row]
- print()
- print(tabulate(flattened_rows, headers=headers))
-
- headers, rows = process_results(modified_compare_group, results, nconfig)
- print()
- print(tabulate(rows, headers=headers))
-
- return 0
-
- def add_args(self, parser: argparse.ArgumentParser):
- super().add_args(parser)
- compare_group = parser.add_argument_group("compare args")
- compare_group.add_argument(
- "--compare",
- "-c",
- nargs="+",
- default=None,
- choices=self.PER_NETWORK_SCRIPTS,
- help="Specific frameworks to compare. If none is specified, all are compared.",
- )
-
-
-class ListAction(Action):
- def __init__(self, networks: List[str], parser: argparse.ArgumentParser):
- super().__init__(networks, parser)
- self.networks = networks
-
- def execute(self, args: argparse.Namespace):
- print("Networks that are supported by HuggingFace Demo:")
- [print(n) for n in self.networks]
- return 0
-
-
-def get_action(
- action_name: str, networks: List[str], parser: argparse.ArgumentParser
-) -> Action:
- return {
- WRAPPER_COMPARE_ACTION: CompareAction,
- WRAPPER_LIST_ACTION: ListAction,
- WRAPPER_RUN_ACTION: RunAction,
- WRAPPER_BENCHMARK_ACTION: BenchmarkAction,
- }[action_name](networks, parser)
-
-
-def get_default_parser(
- networks: List[str], description: str = "", add_default_help=False
-) -> argparse.ArgumentParser:
- """
- Returns argparser for use by main(). Allows the ability to toggle default help message with a custom help flag
- so that argparser does not throw SystemExit when --help is passed in. Useful for custom --help functionality.
-
- Returns:
- (argparse.ArgumentParser): argparser used by main()
- """
- # This variable is set so that usage errors don't show up in wrapper
- parser = argparse.ArgumentParser(
- conflict_handler="resolve",
- description=description,
- add_help=add_default_help,
- prog="run.py",
- )
- required_group = parser.add_argument_group("required wrapper arguments")
-
- required_group.add_argument("action", choices=WRAPPER_ACTIONS)
-
- if not add_default_help:
- parser.add_argument(
- "--help",
- "-h",
- help="Shows help message. If --network is supplied, returns help for specific script.",
- action="store_true",
- )
- return parser
-
-
-def verify_python_version():
- if sys.version_info.major < 3 or sys.version_info.minor <= 6:
- raise RuntimeError("HuggingFace OSS Demo does not support Python <= 3.6 due to end-of-life.")
-
-
-def main() -> None:
- """
- Parses network folders and responsible for passing --help flags to subcommands if --network is provided.
- """
- # Verify python version support
- verify_python_version()
-
- # Get all available network scripts
- networks = register_network_folders(os.getcwd())
-
- # Add network folder for entry point
- description = "Runs TensorRT networks that are based-off of HuggingFace variants."
- parser = get_default_parser(networks, description, add_default_help=False)
-
- # Get the general network wrapper help
- known_args, _ = parser.parse_known_args()
-
- # Delegate parser to action specifics
- action = get_action(known_args.action, networks, parser)
- known_args, _ = parser.parse_known_args()
-
- # If bootstrap occurs, then the spawned process completes the rest of demo.
- # We can exit safely. We spawn after parsing basic args to reduce loading churn on rudimentary help commands.
- if bootstrap_ld_library_path():
- sys.exit(0)
-
- return action.execute(known_args)
-
-
-if __name__ == "__main__":
- main()
diff --git a/demo/HuggingFace/tests/test_interface.py b/demo/HuggingFace/tests/test_interface.py
deleted file mode 100644
index 9dda902f..00000000
--- a/demo/HuggingFace/tests/test_interface.py
+++ /dev/null
@@ -1,62 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Tests and verifies our interface objects
-"""
-
-# std
-import os
-import sys
-
-# pytest
-import pytest
-
-# Add library path
-TEST_DIR = os.path.dirname(os.path.abspath(__file__))
-sys.path.append(os.path.join(TEST_DIR, os.pardir))
-
-
-@pytest.fixture(scope="session")
-def inetwork():
- import NNDF.networks as mod
- return mod
-
-
-def test_network_result(inetwork):
- # Test the API by explicit flags
- inetwork.NetworkResult(
- input="example",
- output_tensor=[],
- semantic_output="hello",
- median_runtime=9001,
- models=[],
- )
-
-
-def test_network_checkpoint_result(inetwork):
- inetwork.NetworkCheckpointResult(network_results=[], accuracy=9001.0, perplexity=5.0)
-
-
-def test_precision(inetwork):
- inetwork.Precision(fp16=True)
-
-
-def test_network_metadata(inetwork):
- inetwork.NetworkMetadata(
- variant="gpt2", precision=inetwork.Precision(fp16=True), other=None
- )
diff --git a/demo/NeMo/.gitignore b/demo/NeMo/.gitignore
new file mode 100644
index 00000000..af9bae11
--- /dev/null
+++ b/demo/NeMo/.gitignore
@@ -0,0 +1,5 @@
+apex/
+Megatron-LM/
+NeMo/
+temp/
+__pycache__/
diff --git a/demo/NeMo/GPT3/GPT3ModelConfig.py b/demo/NeMo/GPT3/GPT3ModelConfig.py
new file mode 100644
index 00000000..0e50d6ce
--- /dev/null
+++ b/demo/NeMo/GPT3/GPT3ModelConfig.py
@@ -0,0 +1,87 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Base Class
+import sys
+sys.path.append('../../HuggingFace') # Include HuggingFace directory
+from NNDF.networks import NNConfig, NetworkMetadata
+
+class GPT3ModelTRTConfig(NNConfig):
+
+ NETWORK_FULL_NAME = "full"
+ TARGET_MODELS = [
+ "gpt-126m",
+ "gpt-1.3b",
+ "gpt-5b",
+ ]
+
+ def __init__(
+ self,
+ metadata,
+ **kwargs
+ ):
+ super().__init__(
+ network_name="GPT3",
+ **kwargs
+ )
+ self.nemo_config = None
+ self.use_mask = False
+ self.metadata = metadata
+ self.variant = metadata.variant
+
+ def from_nemo_config(self, nemo_config):
+ self.nemo_config = nemo_config
+
+ def get_metadata_string(self, metadata: NetworkMetadata) -> str:
+ """
+ Serializes a Metadata object into string.
+ String will be checked if friendly to filenames across Windows and Linux operating systems.
+ This function is a modified version from HuggingFace/NNDF/networks.py.
+
+ returns:
+ string: -[-]*-
+ """
+
+ enabled_precisions = self.nemo_config.trt_export_options
+ precision_str = "-".join(
+ [
+ k for k, v in {
+ "fp8": enabled_precisions.use_fp8,
+ "fp16": enabled_precisions.use_fp16,
+ "bf16": enabled_precisions.use_bf16,
+ }.items() if v
+ ]
+ )
+
+ result = [self.network_name, metadata.variant]
+ if precision_str:
+ result.append(precision_str)
+
+ # Append max sequence length
+ result.append("ms" + str(self.nemo_config.model.max_seq_len))
+
+ if metadata.use_cache:
+ result.append("kv_cache")
+
+ final_str = "-".join(result)
+ assert self._is_valid_filename(
+ final_str
+ ), "Metadata for current network {} is not filename friendly: {}.".format(
+ self.network_name, final_str
+ )
+
+ return final_str
diff --git a/demo/NeMo/GPT3/decoding.py b/demo/NeMo/GPT3/decoding.py
new file mode 100644
index 00000000..2edf66e7
--- /dev/null
+++ b/demo/NeMo/GPT3/decoding.py
@@ -0,0 +1,453 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from collections.abc import Iterable
+import sys
+from typing import List
+
+from apex.transformer.pipeline_parallel.utils import _reconfigure_microbatch_calculator
+from megatron.core import parallel_state
+from nemo.collections.nlp.modules.common.text_generation_strategy import GPTModelTextGenerationStrategy
+from nemo.utils import AppState
+import torch
+import torch.nn.functional as F
+
+from GPT3.trt_utils import GPTTRTDecoder
+
+sys.path.append('../../HuggingFace') # Include HuggingFace
+from NNDF.logger import G_LOGGER
+
+
+def sample_sequence_batch(
+ model,
+ inference_strategy,
+ context_tokens,
+ context_lengths,
+ tokens_to_generate,
+ all_probs=False,
+ temperature=None,
+ extra={},
+):
+ def repetition_penalty(logits, repetition_penalty, used_tokens):
+ """ Implement the repetition penalty, check paper
+ https://arxiv.org/pdf/1909.05858.pdf
+ """
+ if used_tokens is not None and repetition_penalty != 1.0:
+ logits_update = torch.gather(logits, 1, used_tokens)
+ logits = torch.scatter(logits, 1, used_tokens, logits_update / repetition_penalty)
+ return logits
+
+ def top_k_logits(logits, top_k=0, top_p=0.0, filter_value=-float('Inf'), started=None):
+ """
+ This function has been mostly taken from huggingface conversational
+ ai code at
+ https://medium.com/huggingface/how-to-build-a-state-of-the-art-
+ conversational-ai-with-transfer-learning-2d818ac26313
+
+ @param logits: logits tensor
+ @param top_k: keep only top k tokens with highest probability
+ @param top_p: keep the top tokens with cumulative probability
+ @filter_value: value to set filtered tokens to
+ @started: a tensor of bools indicating whether the text generation starts for the batch
+ returns the filtered logits
+ """
+ if top_k > 0:
+ # Remove all tokens with a probability less than the
+ # last token of the top-k
+ indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
+ if started is not None:
+ for i in torch.arange(indices_to_remove.size(0))[started]:
+ logits[i, indices_to_remove[i]] = filter_value
+ else:
+ logits[indices_to_remove] = filter_value
+
+ if top_p > 0.0:
+ # Cconvert to 1D
+ sorted_logits, sorted_indices = torch.sort(logits, descending=True, dim=-1)
+ cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+
+ # Remove tokens with cumulative probability above the threshold
+ sorted_indices_to_remove = cumulative_probs > top_p
+ # Shift the indices to the right to keep also the first token
+ # above the threshold
+ sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+ sorted_indices_to_remove[..., 0] = 0
+ if started is not None:
+ for i in torch.arange(sorted_indices.size(0))[started]:
+ indices_to_remove = sorted_indices[i][sorted_indices_to_remove[i]]
+ logits[i, indices_to_remove] = filter_value
+ else:
+ for i in range(sorted_indices.size(0)):
+ indices_to_remove = sorted_indices[i][sorted_indices_to_remove[i]]
+ logits[i, indices_to_remove] = filter_value
+
+ return logits
+
+ app_state = AppState()
+ batch_size = context_tokens.shape[0]
+ if not (hasattr(model, "trt") or hasattr(model, "onnx")):
+ _reconfigure_microbatch_calculator(
+ rank=app_state.global_rank,
+ rampup_batch_size=None,
+ global_batch_size=batch_size,
+ micro_batch_size=batch_size,
+ data_parallel_size=1,
+ )
+
+ tokenizer = model.tokenizer
+ # initialize the batch
+ with torch.no_grad():
+ context_length = context_lengths.min().item()
+ context_lengths_cpu = context_lengths.cpu()
+ inference_strategy.init_batch(context_tokens, context_length)
+ # added eos_id to support the function generate_samples_eval that passes
+ # eos_id as an argument and needs termination when that id id found.
+ eod_id = tokenizer.eos_id
+ counter = 0
+
+ tokens = context_tokens
+ output_logits = None
+ all_generated_indices = None # used to track all generated indices
+ # Generate enough tokens for the longest sequence
+ maxlen = tokens_to_generate + context_lengths.max().item()
+ maxlen = inference_strategy.clip_max_len(maxlen)
+
+ is_done = torch.zeros([batch_size]).byte()
+ lengths = torch.ones([batch_size]).long() * maxlen
+
+ use_cache = extra.get("use_cache", False)
+ is_onnx = hasattr(model, "onnx")
+ is_trt = hasattr(model, "trt")
+
+ if is_trt:
+ assert isinstance(model.trt, GPTTRTDecoder)
+ input_ids_name = model.trt.get_input_ids_name()
+ input_ids_type = model.trt.get_torch_type(input_ids_name)
+ position_ids_name = model.trt.get_position_ids_name()
+ position_ids_type = model.trt.get_torch_type(position_ids_name)
+ attention_mask_name = model.trt.get_attention_mask_name()
+ if attention_mask_name != None:
+ attention_mask_type = model.trt.get_torch_type(attention_mask_name)
+
+ position_ids = inference_strategy.position_ids
+ attention_mask = inference_strategy.attention_mask
+
+ torch.cuda.nvtx.range_pop() # "Prepare Batch"
+ while context_length < maxlen:
+ torch.cuda.nvtx.range_push("I/O Setup")
+
+ output = None
+ if is_onnx and use_cache:
+ G_LOGGER.warn(f"ONNX runtime path does not support KV-cache.")
+
+ # Modify counter based on using cache or not.
+ if is_trt:
+ # TRT input preprocessing doesn't use nemo function
+ pass
+ elif not is_onnx and use_cache:
+ batch, tensor_shape = inference_strategy.prepare_batch_at_step(
+ tokens, maxlen, batch_size, counter, context_length
+ )
+ else:
+ batch, tensor_shape = inference_strategy.prepare_batch_at_step(
+ tokens, maxlen, batch_size, 0, context_length # step is always 0
+ )
+
+ # inputs input_ids: [BS, SEQ], position_ids: [BS, SEQ], attention_mask: [1, 1, SEQ, SEQ]
+ if is_trt:
+ context_mode = (use_cache and counter == 0) or not use_cache
+ if context_mode or not use_cache:
+ # context mode
+ batch_tokens = tokens[:, :context_length]
+ batch_position_ids = position_ids[:, :context_length]
+ else:
+ # generate mode
+ batch_tokens = tokens[:, context_length - 1].view(batch_size, -1)
+ batch_position_ids = position_ids[:, context_length - 1].view(batch_size, -1)
+ seq_len = batch_tokens.shape[1]
+ batch_attention_mask = attention_mask[0:1, 0:1, :seq_len, :seq_len]
+ input_ids = batch_tokens.type(input_ids_type).contiguous().cuda()
+ tensor_dict = {input_ids_name : (input_ids.data_ptr(), input_ids.shape)}
+ if position_ids_name != None:
+ batch_position_ids = batch_position_ids.type(position_ids_type).contiguous().cuda()
+ tensor_dict[position_ids_name] = (batch_position_ids.data_ptr(), batch_position_ids.shape)
+ if attention_mask_name != None:
+ batch_attention_mask = batch_attention_mask.type(attention_mask_type).contiguous().cuda()
+ tensor_dict[attention_mask_name] = (batch_attention_mask.data_ptr(), batch_attention_mask.shape)
+
+ logits_name = model.trt.get_output_name()
+ torch.cuda.nvtx.range_pop() # "I/O Setup"
+ output = model.trt.run(logits_name, tensor_dict, seq_len, context_mode)
+
+ elif is_onnx:
+ assert len(batch) == 5, "Length of batch must be 5."
+ (
+ batch_tokens,
+ attention_mask,
+ position_ids,
+ set_inference_key_value_memory,
+ _,
+ ) = batch
+ seq_len = batch_tokens.shape[1]
+ attention_mask = attention_mask[0:1, 0:1, 0:seq_len, 0:seq_len]
+
+ from onnxruntime import InferenceSession
+ assert isinstance(model.onnxrt, InferenceSession)
+ # Currently only support onnx runtime with cpu
+ # Our fp8 models don't currently use a user-provided attention_mask
+ tensor_dict = {'input_ids': batch_tokens.cpu().detach().numpy(),
+ 'position_ids': position_ids.cpu().detach().numpy()}
+
+ def have_attention_mask(sess):
+ return any(inp.name == 'attention_mask' for inp in all_inputs)
+
+ if have_attention_mask(model.onnxrt):
+ tensor_dict['attention_mask'] = attention_mask.cpu().detach().numpy()
+ torch.cuda.nvtx.range_pop() # "I/O Setup"
+ output = model.onnxrt.run(['logits'], tensor_dict)[0]
+ output = torch.Tensor(output).cuda()
+ # output logits: [BS, SEQ, 50304]
+ else:
+ # nemo path
+ torch.cuda.nvtx.range_pop() # "I/O Setup"
+ output = inference_strategy.forward_step(batch, tensor_shape)
+ output = output[0]['logits'].float()
+
+ assert output is not None
+ torch.cuda.nvtx.range_push("Output Sampling")
+ output = output.float()
+ logits = output[:, -1].view(batch_size, -1).contiguous()
+
+ # make sure it will generate at least min_length
+ min_length = extra.get('min_tokens_to_generate', 0)
+ if min_length > 0:
+ within_min_length = (context_length - context_lengths) < min_length
+ logits[within_min_length, eod_id] = -float('Inf')
+
+ # make sure it won't sample outside the vocab_size range
+ logits[:, tokenizer.vocab_size :] = -float('Inf')
+
+ # started indicates whether the current token step passes the context_length, so we make sure not to overwrite the context tokens
+ started = context_lengths_cpu <= context_length
+ if extra.get('greedy', False):
+ prev = torch.argmax(logits, dim=-1).view(-1)
+ else:
+ logits = logits.float()
+ logits /= temperature
+ # handle repetition penality
+ logits = repetition_penalty(logits, extra.get('repetition_penalty', 1.0), all_generated_indices)
+ logits = top_k_logits(
+ logits, top_k=extra.get('top_k', 0), top_p=extra.get('top_p', 0.9), started=started
+ )
+ probs = F.softmax(logits, dim=-1)
+ prev = torch.multinomial(probs, num_samples=1).view(-1)
+
+ prev = prev.cpu()
+ # Clamp the predicted out of vocabulary tokens
+ prev = torch.clamp(prev, max=tokenizer.vocab_size - 1)
+ # Replace sampled tokens w/ done token if EOD has already been sampled
+ new_tokens = torch.where(is_done, eod_id, prev)
+ # post process the inference tokens based on the strategy
+ inference_strategy.post_process(tokens, new_tokens, context_length)
+
+ # Insert either new predicted or next prompt token
+ if extra.get("accuracy_mode", False):
+ # We only update the last token for accuracy mode.
+ at_prediction_index = (context_lengths + tokens_to_generate - 1 == context_length)
+ tokens[:, context_length] = torch.where(at_prediction_index, new_tokens.cuda(), tokens[:, context_length])
+ else:
+ tokens[:, context_length] = torch.where(started.cuda(), new_tokens.cuda(), tokens[:, context_length])
+
+ if not extra.get("benchmark_mode", False):
+ if output_logits is None:
+ output = F.log_softmax(output[:, :context_length, :], 2)
+ indices = torch.unsqueeze(tokens[:, 1 : context_length + 1], 2)
+ output_logits = torch.gather(output, 2, indices).squeeze(2)
+ all_generated_indices = indices[:, :, 0]
+ if all_probs:
+ full_logits = output
+ else:
+ output = F.log_softmax(output, 2)
+ indices = torch.unsqueeze(new_tokens.cuda(), 1).unsqueeze(2)
+ new_output_logits = torch.gather(output, 2, indices).squeeze(2)
+
+ # This copy can be optimized out by pre-allocating the memory.
+ output_logits = torch.cat([output_logits, new_output_logits], 1)
+ all_generated_indices = torch.cat([all_generated_indices, indices[:, :, 0]], 1)
+ if all_probs:
+ if extra.get("use_cache", False):
+ full_logits = torch.cat([full_logits, output], 1)
+ else:
+ full_logits = output
+
+ done_token = (prev == eod_id)
+ done_token = done_token.byte() & started.byte()
+
+ just_finished = (done_token & ~is_done).bool()
+ lengths[just_finished.view(-1)] = context_length
+ is_done = is_done | done_token
+
+ done = torch.all(is_done)
+ torch.cuda.nvtx.range_pop() # "Output Sampling"
+
+ context_length += 1
+ counter += 1
+ if done and not extra.get("benchmark_mode", False):
+ break
+
+ if all_probs:
+ return tokens, context_length, lengths, output_logits, full_logits
+ return tokens, context_length, lengths, output_logits, None
+
+def initialize_ddp(model, cfg):
+ # check whether the DDP is initialized
+ if cfg.runtime == "nemo" and parallel_state.is_unitialized():
+ def dummy():
+ return
+ if model.trainer.strategy.launcher is not None:
+ model.trainer.strategy.launcher.launch(dummy, trainer=model.trainer)
+ model.trainer.strategy.setup_environment()
+
+ if model.cfg.get('transformer_engine', False):
+ model.setup_transformer_engine_tp_groups()
+
+def get_special_tokens(tokenizer):
+ special_tokens = set()
+ if hasattr(tokenizer, 'pad_token') and tokenizer.pad_token is not None:
+ special_tokens.add(tokenizer.pad_token)
+ if hasattr(tokenizer, 'eos_token') and tokenizer.eos_token is not None:
+ special_tokens.add(tokenizer.eos_token)
+ if hasattr(tokenizer, 'bos_token') and tokenizer.bos_token is not None:
+ special_tokens.add(tokenizer.bos_token)
+ if hasattr(tokenizer, 'cls_token') and tokenizer.cls_token is not None:
+ special_tokens.add(tokenizer.cls_token)
+ if hasattr(tokenizer, 'unk_token') and tokenizer.unk_token is not None:
+ special_tokens.add(tokenizer.unk_token)
+ if hasattr(tokenizer, 'sep_token') and tokenizer.sep_token is not None:
+ special_tokens.add(tokenizer.sep_token)
+ if hasattr(tokenizer, 'mask_token') and tokenizer.mask_token is not None:
+ special_tokens.add(tokenizer.mask_token)
+ return special_tokens
+
+def process_output(model, output, return_segments=False):
+ torch.cuda.nvtx.range_push("Process Output")
+ inference_strategy = GPTModelTextGenerationStrategy(model)
+ tokenizer = model.tokenizer
+ if output is not None:
+ decode_tokens, output_logits, full_logits = output
+ decode_tokens = decode_tokens.cpu().numpy().tolist()
+
+ # convert ids to text by applying tokenizer
+ resp_sentences = list(map(tokenizer.ids_to_text, decode_tokens))
+
+ all_offsets = []
+ resp_sentences_seg = []
+ if return_segments:
+ # segments sentences into words.
+ for decode_token in decode_tokens:
+ words = []
+ for token in decode_token:
+ if not isinstance(token, Iterable):
+ token = [token]
+ word = tokenizer.ids_to_tokens(token)
+ if isinstance(word, Iterable):
+ word = word[0]
+ if hasattr(tokenizer.tokenizer, 'byte_decoder'):
+ word = bytearray([tokenizer.tokenizer.byte_decoder[c] for c in word]).decode(
+ 'utf-8', errors='replace'
+ )
+ words.append(word)
+ resp_sentences_seg.append(words)
+
+ # offsets calculation
+ special_tokens = get_special_tokens(tokenizer)
+ for item in resp_sentences_seg:
+ offsets = [0]
+ for index, token in enumerate(item):
+ if index != len(item) - 1:
+ if token in special_tokens:
+ offsets.append(offsets[-1])
+ else:
+ offsets.append(len(token) + offsets[-1])
+ all_offsets.append(offsets)
+
+ output = {}
+ output['sentences'] = resp_sentences
+ output['tokens'] = resp_sentences_seg
+ output['logprob'] = output_logits
+ output['full_logprob'] = full_logits
+ output['token_ids'] = decode_tokens
+ output['offsets'] = all_offsets
+ output = inference_strategy.post_generation_process(output)
+ torch.cuda.nvtx.range_pop() # "Process Output"
+ return output
+
+def generate(model, inputs, cfg):
+ torch.cuda.nvtx.range_push("Prepare Batch")
+ initialize_ddp(model, cfg)
+
+ tokens_to_generate = cfg.inference.tokens_to_generate
+ min_tokens_to_generate = cfg.inference.min_tokens_to_generate
+ add_BOS = cfg.inference.add_BOS
+ all_probs = cfg.inference.all_probs
+ temperature = cfg.inference.temperature
+ is_benchmark_mode = True if cfg.mode == "benchmark" else False
+ is_accuracy_mode = True if cfg.mode == "accuracy" else False
+
+ inference_strategy = GPTModelTextGenerationStrategy(model)
+ if isinstance(inputs, tuple):
+ context_tokens_tensor, context_length_tensor = inputs
+ else:
+ context_tokens_tensor, context_length_tensor = inference_strategy.tokenize_batch(
+ inputs, tokens_to_generate, add_BOS
+ )
+
+ context_length = context_length_tensor.min().item()
+
+ batch_token_result = sample_sequence_batch(
+ model,
+ inference_strategy,
+ context_tokens_tensor,
+ context_length_tensor,
+ tokens_to_generate,
+ all_probs,
+ temperature=temperature,
+ extra={
+ "top_p": cfg.inference.top_p,
+ "top_k": cfg.inference.top_k,
+ "greedy": cfg.inference.greedy,
+ "repetition_penalty": cfg.inference.repetition_penalty,
+ "min_tokens_to_generate": min_tokens_to_generate,
+ "use_cache": cfg.use_cache,
+ "benchmark_mode": is_benchmark_mode,
+ "accuracy_mode": is_accuracy_mode,
+ "use_fp8_storage": cfg.onnx_export_options.use_fp8_storage,
+ },
+ )
+
+ tokens, context_length, _, output_logits, full_logits = batch_token_result
+
+ output = None
+ if tokens is not None:
+ output = tokens[:, :context_length], output_logits, full_logits
+ return output
+
+def full_inference(model, inputs, cfg):
+ output = generate(model, inputs, cfg)
+ if output is not None:
+ output = process_output(model, output, return_segments=(cfg.mode is not "benchmark"))
+ return output
diff --git a/demo/NeMo/GPT3/frameworks.py b/demo/NeMo/GPT3/frameworks.py
new file mode 100644
index 00000000..851f4cdf
--- /dev/null
+++ b/demo/NeMo/GPT3/frameworks.py
@@ -0,0 +1,81 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+import sys
+
+import omegaconf
+
+# Add syspath for custom library
+if __name__ == "__main__":
+ filepath = os.path.dirname(os.path.abspath(__file__))
+ project_root = os.path.join(filepath, os.pardir)
+ sys.path.append(project_root)
+
+from GPT3.nemo_utils import load_nemo_model
+from GPT3.GPT3ModelConfig import GPT3ModelTRTConfig
+from interface import NeMoCommand
+
+sys.path.append('../../HuggingFace') # Include HuggingFace
+from NNDF.interface import FRAMEWORK_NATIVE
+from NNDF.networks import (
+ NetworkModel,
+ NetworkModels,
+)
+
+class GPT3NeMoTorch(NeMoCommand):
+ def __init__(
+ self,
+ nemo_cfg,
+ config_class=GPT3ModelTRTConfig,
+ description="Runs framework results for GPT3 model with NeMo.",
+ **kwargs
+ ):
+ super().__init__(nemo_cfg, config_class, description, model_classes=None, **kwargs)
+ self.framework_name = FRAMEWORK_NATIVE
+
+ def setup_tokenizer_and_model(self):
+ self.nemo_cfg.runtime = 'nemo'
+ self.model = load_nemo_model(self.nemo_cfg)
+ self.tokenizer = self.model.tokenizer
+
+ torch_models = [
+ NetworkModel(
+ name=GPT3ModelTRTConfig.NETWORK_FULL_NAME, fpath=self.workspace.torch_path
+ )
+ ]
+ return NetworkModels(torch=torch_models, onnx=None, trt=None)
+
+ def process_framework_specific_arguments(self, onnx_model: str = None, **kwargs):
+ if onnx_model:
+ raise RuntimeError(
+ "native framework does not support loading an ONNX file via `onnx-model` yet. Please specify the NeMo model using `nemo-model` instead."
+ )
+
+
+# Entry point
+def getGPT3NeMoTorch():
+ config_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "../config.yaml")
+ nemo_cfg = omegaconf.OmegaConf.load(config_path)
+ return GPT3NeMoTorch(nemo_cfg)
+
+# Entry point
+RUN_CMD = getGPT3NeMoTorch()
+
+if __name__ == "__main__":
+ result = RUN_CMD()
+ print("Results: {}".format(result))
diff --git a/demo/NeMo/GPT3/lambada_dataset.py b/demo/NeMo/GPT3/lambada_dataset.py
new file mode 100644
index 00000000..a7945cec
--- /dev/null
+++ b/demo/NeMo/GPT3/lambada_dataset.py
@@ -0,0 +1,126 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import os
+import collections
+import json
+import requests
+import sys
+import torch
+from torch.nn.utils.rnn import pad_sequence
+
+# Add syspath for custom library
+if __name__ == "__main__":
+ filepath = os.path.dirname(os.path.abspath(__file__))
+ project_root = os.path.join(filepath, os.pardir)
+ sys.path.append(project_root)
+
+from nemo_export import create_dir_if_not_exist
+
+__all__ = ['Lambada']
+
+
+class Lambada():
+
+ def __init__(self, base_dir, tokens_to_generate, padding = -1, max_length = 2048):
+ assert tokens_to_generate >= 1
+ assert padding == -1 or tokens_to_generate == 1
+ self.base_dir = base_dir
+ self.tokens_to_generate = tokens_to_generate
+ self.padding = padding
+ self.max_length = max_length
+ self.download()
+
+ def get_data_file_path(self):
+ path = os.path.join(self.base_dir, "lambada")
+ path = os.path.join(path, "lambada_test.jsonl")
+ create_dir_if_not_exist(path)
+ return path
+
+ def download(self):
+ path = self.get_data_file_path()
+ if not os.path.exists(path):
+ url = "https://github.com/cybertronai/bflm/raw/master/lambada_test.jsonl"
+ with requests.get(url) as r, open(path, 'wb') as fh:
+ fh.write(r.content)
+
+ def load(self):
+ path = self.get_data_file_path()
+ with open(path) as fh:
+ for line in fh:
+ yield json.loads(line)
+
+ def _preprocess(self, text):
+ text = text.replace("“", '"')
+ text = text.replace("”", '"')
+ text = text.replace("’", "'")
+ text = text.replace("‘", "'")
+ return text
+
+ def doc_to_text(self, doc):
+ return "\n" + self._preprocess(doc["text"].rsplit(" ", 1)[0]).strip()
+
+ def doc_to_target(self, doc):
+ split_text = doc["text"].rsplit(" ", 1)
+ if len(split_text) <= 1:
+ raise ValueError(f"Input doc '{doc}' does not have target.")
+ return " " + self._preprocess(split_text[1])
+
+ def preprocess_input(self, tokenizer, docs):
+ _Input = collections.namedtuple("_DS_Input", ["inputs", "inp_enc", "lens", "lens_pad", "conti_len"])
+ batch_size = len(docs)
+ tokens = []
+ conti_lens = []
+ lens = []
+ inp_encs = []
+ for doc in docs:
+ # Handle padded text
+ if not doc["text"]:
+ inp_enc = [0]
+ conti_len = 0
+ else:
+ text = self.doc_to_text(doc)
+ target = self.doc_to_target(doc)
+
+ context_enc = tokenizer.text_to_ids(text)
+ continuation_enc = tokenizer.text_to_ids(target)
+
+ inp_enc = (context_enc + continuation_enc)[-(self.max_length + 1) :]
+ conti_len = len(continuation_enc)
+
+ inp_encs.append(inp_enc)
+ conti_lens.append(conti_len)
+ tokens.append(torch.tensor(inp_enc))
+ lens.append(len(inp_enc) - 1)
+ max_lens = max(lens)
+
+ tokens_pad = pad_sequence(tokens, batch_first=False, padding_value=tokenizer.eos_id)
+ if self.padding != -1 and max_lens % self.padding != 0:
+ # We need align the context length to multiple of 8 for FP8 run using NeMo framework.
+ extra_pad_len = self.padding - (max_lens % self.padding)
+
+ extra_pad = torch.ones(extra_pad_len, batch_size) * tokenizer.eos_id
+ extra_pad = extra_pad.type_as(tokens_pad)
+ inp_enc_pad = torch.vstack((tokens_pad, extra_pad)).T
+
+ lens_pad = max_lens + extra_pad_len
+ else:
+ inp_enc_pad = tokens_pad.T
+ lens_pad = max_lens + 1 - self.tokens_to_generate
+
+ inputs = (torch.tensor(inp_enc_pad).cuda(), (torch.ones(batch_size, dtype=torch.int32) * lens_pad).cuda())
+ return _Input(inputs=inputs, inp_enc=inp_encs, lens=lens, lens_pad=lens_pad, conti_len=conti_lens)
+
diff --git a/demo/NeMo/GPT3/nemo_utils.py b/demo/NeMo/GPT3/nemo_utils.py
new file mode 100644
index 00000000..f6d5bca7
--- /dev/null
+++ b/demo/NeMo/GPT3/nemo_utils.py
@@ -0,0 +1,161 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import gc
+import os
+import sys
+
+# Only print out error messages from NeMo
+from nemo.utils.nemo_logging import Logger as NG_LOGGER
+nemo_logger = NG_LOGGER(False)
+nemo_logger.setLevel(nemo_logger.ERROR)
+
+from nemo.utils.app_state import AppState
+from nemo.utils.model_utils import inject_model_parallel_rank
+from nemo.collections.nlp.modules.common.megatron.megatron_init import fake_initialize_model_parallel
+from nemo.collections.nlp.models.language_modeling.megatron_gpt_model import MegatronGPTModel
+from nemo.collections.nlp.parts.nlp_overrides import NLPDDPStrategy, NLPSaveRestoreConnector
+from omegaconf import OmegaConf, open_dict
+from pytorch_lightning.trainer.trainer import Trainer
+import torch
+
+sys.path.append('../../HuggingFace') # Include HuggingFace directory.
+from NNDF.logger import G_LOGGER
+
+
+def get_computeprob_response(tokenizer, response, inputs):
+ """
+ This function is a modified version from:
+ https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/modules/common/text_generation_utils.py#L139
+
+ So parallel state does not need to be initialized before calling this function.
+ """
+ compute_prob_response = {}
+ new_token_ids = []
+ new_tokens = []
+ new_texts = []
+ log_probs = []
+ full_logprobs = []
+ offsets = []
+ for batch_id in range(len(response['tokens'])):
+ if isinstance(inputs, (list, tuple)):
+ if isinstance(inputs[0], str):
+ new_token_id = tokenizer.text_to_ids(inputs[batch_id])
+ new_text = inputs[batch_id]
+ token_len = len(new_token_id)
+ elif isinstance(inputs[0], torch.Tensor):
+ token_len = int(inputs[1][batch_id].item())
+ new_token_id = inputs[0][batch_id][:token_len].tolist()
+ new_text = tokenizer.ids_to_text(new_token_id)
+ new_token_ids.append(new_token_id)
+ new_tokens.append(response['tokens'][batch_id][:token_len])
+ new_texts.append(new_text)
+ log_probs.append(response['logprob'][batch_id][:token_len])
+ full_logprobs.append(response['full_logprob'][batch_id][:token_len])
+ offsets.append(response['offsets'][batch_id][:-1])
+ compute_prob_response['sentences'] = new_texts
+ compute_prob_response['tokens'] = new_tokens
+ compute_prob_response['token_ids'] = new_token_ids
+ compute_prob_response['logprob'] = log_probs
+ compute_prob_response['full_logprob'] = full_logprobs
+ compute_prob_response['offsets'] = offsets
+ return compute_prob_response
+
+
+def load_nemo_model(cfg, model_class=MegatronGPTModel):
+ # Trainer is required for restoring model parallel models
+ trainer = Trainer(strategy=NLPDDPStrategy(), **cfg.trainer)
+
+ if cfg.gpt_model_file and cfg.checkpoint_dir:
+ raise ValueError(f"NeMo model and checkpoint cannot be both set.")
+
+ if cfg.gpt_model_file:
+ save_restore_connector = NLPSaveRestoreConnector()
+ if os.path.isdir(cfg.gpt_model_file):
+ save_restore_connector.model_extracted_dir = cfg.gpt_model_file
+
+ pretrained_cfg = MegatronGPTModel.restore_from(
+ restore_path=cfg.gpt_model_file,
+ trainer=trainer,
+ return_config=True,
+ save_restore_connector=save_restore_connector,
+ )
+ OmegaConf.set_struct(pretrained_cfg, True)
+ with open_dict(pretrained_cfg):
+ pretrained_cfg.sequence_parallel = False
+ pretrained_cfg.activations_checkpoint_granularity = None
+ pretrained_cfg.activations_checkpoint_method = None
+ pretrained_cfg.precision = trainer.precision
+ if trainer.precision == "16":
+ pretrained_cfg.megatron_amp_O2 = False
+ model = model_class.restore_from(
+ restore_path=cfg.gpt_model_file,
+ trainer=trainer,
+ override_config_path=pretrained_cfg,
+ save_restore_connector=save_restore_connector,
+ )
+ G_LOGGER.info(f"{type(model)} has been successfully restored from {cfg.gpt_model_file}")
+ elif cfg.checkpoint_dir:
+ checkpoint_file= os.path.join(cfg.checkpoint_dir, cfg.checkpoint_name)
+ if not os.path.exists(checkpoint_file):
+ raise ValueError(f"File {checkpoint_file} does not exist.")
+
+ app_state = AppState()
+ if cfg.tensor_model_parallel_size > 1 or cfg.pipeline_model_parallel_size > 1:
+ app_state.model_parallel_size = cfg.tensor_model_parallel_size * cfg.pipeline_model_parallel_size
+ app_state.tensor_model_parallel_size = cfg.tensor_model_parallel_size
+ app_state.pipeline_model_parallel_size = cfg.pipeline_model_parallel_size
+ (
+ app_state.tensor_model_parallel_rank,
+ app_state.pipeline_model_parallel_rank,
+ app_state.model_parallel_size,
+ app_state.data_parallel_size,
+ app_state.pipeline_model_parallel_split_rank,
+ app_state.virtual_pipeline_model_parallel_rank,
+ ) = fake_initialize_model_parallel(
+ world_size=app_state.model_parallel_size,
+ rank=trainer.global_rank,
+ tensor_model_parallel_size_=cfg.tensor_model_parallel_size,
+ pipeline_model_parallel_size_=cfg.pipeline_model_parallel_size,
+ pipeline_model_parallel_split_rank_=cfg.pipeline_model_parallel_split_rank,
+ )
+ checkpoint_path = inject_model_parallel_rank(checkpoint_file)
+ model = model_class.load_from_checkpoint(checkpoint_path, hparams_file=cfg.hparams_file, trainer=trainer)
+ G_LOGGER.info(f"{type(model)} has been successfully restored from checkpoint {checkpoint_path}")
+ else:
+ raise ValueError("Need to provide a nemo gpt model through config file.")
+
+ model.freeze()
+
+ # Have to turn off activations_checkpoint_method for inference
+ try:
+ model.model.language_model.encoder.activations_checkpoint_method = None
+ except AttributeError:
+ pass
+
+ model.eval()
+ G_LOGGER.debug(f"Model configuration: {model.cfg}")
+ G_LOGGER.debug(f"Vocabulary size: {model.tokenizer.vocab_size}")
+ return model.cuda()
+
+def release_nemo_model(model):
+ print(f"Releaseing nemo model.")
+ model.model.cpu()
+ del model.model
+ gc.collect()
+ torch.cuda.empty_cache()
+ model.model = None
diff --git a/demo/NeMo/GPT3/onnxrt.py b/demo/NeMo/GPT3/onnxrt.py
new file mode 100644
index 00000000..78bd0aca
--- /dev/null
+++ b/demo/NeMo/GPT3/onnxrt.py
@@ -0,0 +1,112 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+import sys
+
+import onnxruntime as ort
+import onnx
+import omegaconf
+from nemo.collections.nlp.modules.common.tokenizer_utils import get_tokenizer
+from nemo.collections.nlp.models.language_modeling.megatron_gpt_model import MegatronGPTModel
+
+# Add syspath for custom library
+if __name__ == "__main__":
+ filepath = os.path.dirname(os.path.abspath(__file__))
+ project_root = os.path.join(filepath, os.pardir)
+ sys.path.append(project_root)
+
+from interface import NeMoCommand, BaseModel
+from nemo_export import NeMoConverter
+from GPT3.GPT3ModelConfig import GPT3ModelTRTConfig
+
+sys.path.append('../../HuggingFace') # Include HuggingFace
+from NNDF.interface import FRAMEWORK_ONNXRT
+from NNDF.logger import G_LOGGER
+from NNDF.networks import (
+ NetworkModel,
+ NetworkModels,
+)
+
+class GPT3NeMoOnnxRT(NeMoCommand):
+ def __init__(
+ self,
+ nemo_cfg,
+ config_class=GPT3ModelTRTConfig,
+ description="Runs ONNX Runtime results for GPT3 model.",
+ **kwargs
+ ):
+ super().__init__(nemo_cfg, config_class, description, model_classes=None, **kwargs)
+ self.framework_name = FRAMEWORK_ONNXRT
+
+
+ def load_onnx_model(self):
+ G_LOGGER.info(f'Loading ONNX model from {self.nemo_cfg.onnx_model_file}')
+
+ def get_opset_version(name : str) -> int:
+ """Returns opset.
+
+ `model` here is local in scope and python's gc will collect
+ it without manual memory management via `del`.
+ """
+ model = onnx.load(name, load_external_data=False)
+ return model.opset_import[0].version
+
+ assert get_opset_version(self.nemo_cfg.onnx_model_file) == 17
+ return ort.InferenceSession(self.nemo_cfg.onnx_model_file)
+
+
+ def setup_tokenizer_and_model(self):
+ self.nemo_cfg.runtime = 'onnx'
+ self.model = BaseModel()
+ self.model.cfg = self.nemo_cfg.model
+ self.model.tokenizer = get_tokenizer(tokenizer_name='megatron-gpt-345m', vocab_file=None, merges_file=None)
+
+ if not self.nemo_cfg.onnx_model_file:
+ self.nemo_cfg.onnx_model_file = os.path.join(
+ self.workspace.dpath,
+ f"onnx/model-{self.nemo_cfg.trainer.precision}.onnx",
+ )
+
+ converter = NeMoConverter(self.nemo_cfg, MegatronGPTModel)
+ if not os.path.isfile(self.nemo_cfg.onnx_model_file):
+ # Convert NeMo model to ONNX model
+ onnx_name = converter.nemo_to_onnx()
+ self.nemo_cfg.onnx_model_file = onnx_name
+
+ # The ONNX model is in opset17 by default.
+ self.model.onnxrt = self.load_onnx_model()
+ self.tokenizer = self.model.tokenizer
+ onnx_models = [
+ NetworkModel(
+ name=GPT3ModelTRTConfig.NETWORK_FULL_NAME, fpath=self.nemo_cfg.onnx_model_file,
+ )
+ ]
+ return NetworkModels(torch=None, onnx=onnx_models, trt=None)
+
+# Entry point
+def getGPT3NeMoOnnxRT():
+ config_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "../config.yaml")
+ nemo_cfg = omegaconf.OmegaConf.load(config_path)
+ return GPT3NeMoOnnxRT(nemo_cfg)
+
+# Entry point
+RUN_CMD = getGPT3NeMoOnnxRT()
+
+if __name__ == "__main__":
+ result = RUN_CMD()
+ print("Results: {}".format(result))
diff --git a/demo/NeMo/GPT3/sequence_perplexity.py b/demo/NeMo/GPT3/sequence_perplexity.py
new file mode 100644
index 00000000..9fc9ef29
--- /dev/null
+++ b/demo/NeMo/GPT3/sequence_perplexity.py
@@ -0,0 +1,76 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import math
+import numpy as np
+import torch
+
+__all__ = ['SequencePerplexity']
+
+class SequencePerplexity():
+ def __init__(self, topN):
+ super().__init__()
+ self.ppls = []
+ self.sequence_ppls = []
+ self.topN_equals = [0] * len(topN)
+ self.topN = topN
+
+ def update(self, ds_input, response, tokenizer):
+ for batch, tokens in enumerate(response['token_ids']):
+ inp_len = ds_input.lens[batch]
+ if inp_len == 0:
+ continue
+
+ conti_len = ds_input.conti_len[batch]
+
+ response_token_ids = tokens[:inp_len]
+ assert response_token_ids == ds_input.inp_enc[batch][:-1], f"Mismatch in input tokens."
+ full_log_probs = response['full_logprob'][batch][:inp_len]
+
+ # calculate ppl with whole sequence.
+ label = torch.tensor([ds_input.inp_enc[batch][1:]]).cuda()
+ log_probs = full_log_probs.unsqueeze(0).permute((0, 2, 1))
+ ppl = torch.nn.CrossEntropyLoss()(log_probs, label)
+ self.sequence_ppls.append(ppl.cpu())
+
+ # calculate topN.
+ log_probs = full_log_probs[-conti_len:]
+ conti_token_ids = ds_input.inp_enc[batch][-conti_len:]
+ conti_tokens = tokenizer.ids_to_tokens(conti_token_ids)
+
+ for index, topN in enumerate(self.topN):
+ if conti_token_ids[0] in log_probs.topk(topN, dim=-1).indices:
+ self.topN_equals[index] += 1
+
+ # calculate ppl with last token.
+ log_probs = log_probs.cpu().to(torch.float32)
+ conti_enc = torch.tensor(tokenizer.tokens_to_ids(conti_tokens))
+ conti_probs = torch.gather(log_probs, 1, conti_enc.unsqueeze(-1)).squeeze(-1)
+
+ ppl = float(conti_probs.sum())
+ self.ppls.append(ppl)
+
+ def compute(self):
+ ppls = math.exp(-np.mean(np.array(self.ppls)))
+ sequence_ppls = math.exp(np.mean(np.array(self.sequence_ppls)))
+ acc = [equals / len(self.ppls) for equals in self.topN_equals]
+ txt = []
+ for i, j in zip(self.topN, acc):
+ txt.append("acc(top{}): {:.4f}".format(i, j))
+ acc_text = ", ".join(txt)
+ return ppls, sequence_ppls, acc, acc_text
+
diff --git a/demo/NeMo/GPT3/trt.py b/demo/NeMo/GPT3/trt.py
new file mode 100644
index 00000000..189c1ba3
--- /dev/null
+++ b/demo/NeMo/GPT3/trt.py
@@ -0,0 +1,236 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+import sys
+
+import omegaconf
+from nemo.collections.nlp.modules.common.tokenizer_utils import get_tokenizer
+from nemo.collections.nlp.models.language_modeling.megatron_gpt_model import MegatronGPTModel
+
+# Add syspath for custom library
+if __name__ == "__main__":
+ filepath = os.path.dirname(os.path.abspath(__file__))
+ project_root = os.path.join(filepath, os.pardir)
+ sys.path.append(project_root)
+
+from nemo_export import NeMoConverter, create_dir_if_not_exist
+from GPT3.GPT3ModelConfig import GPT3ModelTRTConfig
+from GPT3.trt_utils import load_trt_model
+from interface import NeMoCommand, BaseModel
+import onnx
+
+sys.path.append('../../HuggingFace') # Include HuggingFace
+from NNDF.interface import FRAMEWORK_TENSORRT
+from NNDF.logger import G_LOGGER
+from NNDF.models import _log_fake_perf_metrics
+from NNDF.networks import (
+ NetworkModel,
+ NetworkModels,
+)
+
+class GPT3NeMoTRT(NeMoCommand):
+ def __init__(
+ self,
+ nemo_cfg,
+ config_class=GPT3ModelTRTConfig,
+ description="Runs TensorRT results for GPT3 model.",
+ **kwargs
+ ):
+ super().__init__(nemo_cfg, config_class, description, model_classes=None, **kwargs)
+ self.framework_name = FRAMEWORK_TENSORRT
+
+
+ def setup_tokenizer_and_model(self):
+ self.nemo_cfg.runtime = 'trt'
+ self.model = BaseModel()
+ self.model.cfg = self.nemo_cfg.model
+ self.model.tokenizer = get_tokenizer(tokenizer_name='megatron-gpt-345m', vocab_file=None, merges_file=None)
+
+ # Path to write new onnx models if need arises. Prevents overwrite of
+ # user-provided onnx files in case opset_version needs to be upgraded
+ # to 19 or onnx files with kv-cache needs to be written.
+ onnx_workpath = os.path.join(
+ self.workspace.dpath,
+ "onnx",
+ )
+ if self.nemo_cfg.onnx_model_file:
+ # Input by user, can be a read-only location.
+ onnx_name = self.nemo_cfg.onnx_model_file
+ else:
+ onnx_name = os.path.join(
+ onnx_workpath,
+ f"model-{self.nemo_cfg.trainer.precision}.onnx",
+ )
+ self.nemo_cfg.onnx_model_file = onnx_name
+ self.nemo_cfg.trt_export_options.timing_cache = self.timing_cache
+
+ converter = NeMoConverter(self.nemo_cfg, MegatronGPTModel)
+ if not os.path.isfile(onnx_name):
+ # Convert NeMo model to ONNX model
+ onnx_name = converter.nemo_to_onnx()
+
+ def get_opset_version(name : str) -> int:
+ """Returns opset.
+
+ `model` here is local in scope and python's gc will collect
+ it without manual memory management via `del`.
+ """
+ model = onnx.load(name, load_external_data=False)
+ return model.opset_import[0].version
+
+ opset_version = get_opset_version(onnx_name)
+ if opset_version < 19:
+ opset19_onnx_name = NeMoConverter.get_opset19_onnx_fpath(
+ onnx_name, onnx_workpath
+ )
+ if not os.path.isfile(opset19_onnx_name):
+ opset19_onnx_name = NeMoConverter.onnx_to_opset19(
+ onnx_name, onnx_workpath
+ )
+
+ if opset19_onnx_name != None:
+ onnx_name = opset19_onnx_name
+
+ # Add KV cache to ONNX model
+ kv_output_policy = "kv_new"
+
+ converter = NeMoConverter(self.nemo_cfg)
+
+ def has_kv_cache_support(
+ model_name: str, match_names=("key", "value", "kv")
+ ) -> bool:
+ """To detect onnx models with kv_cache exported, input node names
+ contain match_names.
+ """
+ model = onnx.load(model_name, load_external_data=False)
+
+ # Get network inputs.
+ input_all = [node.name for node in model.graph.input]
+ input_initializer = [node.name for node in model.graph.initializer]
+ net_input_names = list(set(input_all) - set(input_initializer))
+
+ kv_nodes = filter(
+ lambda name: any(map(lambda match: match in name, match_names)),
+ net_input_names,
+ )
+ return any(kv_nodes) and len(net_input_names) > 2
+
+ if (not self.nemo_cfg.use_cache) and (has_kv_cache_support(onnx_name)):
+ raise RuntimeError(
+ "ONNX model has been exported with kv-cache enabled, but "
+ "runtime configuration has kv-cache disabled. Consider "
+ "enabling kv-cache support via the `use-cache` option."
+ )
+
+ if self.nemo_cfg.use_cache and (not has_kv_cache_support(onnx_name)):
+ G_LOGGER.info(f"Converting {onnx_name} with KV-cache support")
+ new_dir = onnx_workpath + f"_{kv_output_policy}"
+ if self.nemo_cfg.onnx_export_options.use_fp8_storage:
+ new_dir += f"_fp8_storage"
+ onnx_output_fpath = os.path.join(new_dir, onnx_name.split("/")[-1])
+
+ if not os.path.isfile(onnx_output_fpath):
+ create_dir_if_not_exist(onnx_output_fpath)
+ converter.create_onnx(onnx_name, onnx_output_fpath, kv_output_policy)
+ onnx_name = onnx_output_fpath
+
+ if self.nemo_cfg.onnx_export_options.prune:
+ onnx_name = converter.prune_onnx(onnx_name)
+
+ # Convert ONNX model to TRT engine
+ self.nemo_cfg.trt_export_options.use_strongly_typed = self.use_strongly_typed
+ self.nemo_cfg.trt_export_options.timing_cache = self.timing_cache
+ self.nemo_cfg.trt_export_options.opt_seq_len = self.opt_seq_len
+
+ suffixes = []
+ suffixes.append("bs" + str(self.nemo_cfg.batch_size))
+ if self.nemo_cfg.trt_export_options.opt_seq_len != None:
+ suffixes.append("opt" + str(self.nemo_cfg.trt_export_options.opt_seq_len))
+ if self.nemo_cfg.use_cache:
+ suffixes.append("kv")
+ if self.nemo_cfg.onnx_export_options.use_fp8_storage:
+ suffixes.append("fp8_storage")
+ if self.nemo_cfg.trt_export_options.sparse:
+ suffixes.append("sp")
+ if not self.nemo_cfg.trt_export_options.use_strongly_typed:
+ suffixes.append("no_strongly_typed")
+ suffix = "-".join(suffixes)
+ trt_fpath = os.path.join(self.workspace.dpath, f"trt-{suffix}.plan")
+
+ if os.path.isfile(trt_fpath):
+ G_LOGGER.debug(f"TRT Engine plan exists at location {trt_fpath}.")
+ _log_fake_perf_metrics()
+ else:
+ converter.onnx_to_trt(onnx_name, trt_fpath)
+
+ self.nemo_cfg.trt_engine_file = trt_fpath
+ self.model.trt = load_trt_model(self.nemo_cfg)
+ self.tokenizer = self.model.tokenizer
+ onnx_models = [
+ NetworkModel(
+ name=GPT3ModelTRTConfig.NETWORK_FULL_NAME, fpath=self.nemo_cfg.onnx_model_file,
+ )
+ ]
+ return NetworkModels(torch=None, onnx=onnx_models, trt=None)
+
+ def add_args(self):
+ super().add_args()
+ engine_group = self._parser.add_argument_group("trt engine")
+ engine_group.add_argument(
+ "--opt-seq-len",
+ default=None,
+ help="Set optimized input sequence length to be used in engine building",
+ type=int,
+ )
+ engine_group.add_argument(
+ "--no-timing-cache",
+ default=False,
+ help="Set to not use timing cache for speeding up engine building",
+ action="store_true",
+ )
+ engine_group.add_argument(
+ "--no-strongly-typed",
+ default=False,
+ help="Disable strongly typed mode in engine building",
+ action="store_true",
+ )
+
+ def process_framework_specific_arguments(
+ self,
+ opt_seq_len: int = None,
+ no_timing_cache: bool = False,
+ no_strongly_typed: bool = False,
+ **kwargs
+ ):
+ self.opt_seq_len = opt_seq_len
+ self.use_timing_cache = not no_timing_cache
+ self.use_strongly_typed = not no_strongly_typed
+ self.timing_cache = self.workspace.get_timing_cache() if self.use_timing_cache else None
+
+# Entry point
+def getGPT3NeMoTRT():
+ config_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "../config.yaml")
+ nemo_cfg = omegaconf.OmegaConf.load(config_path)
+ return GPT3NeMoTRT(nemo_cfg)
+
+# Entry point
+RUN_CMD = getGPT3NeMoTRT()
+
+if __name__ == "__main__":
+ result = RUN_CMD()
+ print("Results: {}".format(result))
diff --git a/demo/NeMo/GPT3/trt_utils.py b/demo/NeMo/GPT3/trt_utils.py
new file mode 100644
index 00000000..a146cf7e
--- /dev/null
+++ b/demo/NeMo/GPT3/trt_utils.py
@@ -0,0 +1,231 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import sys
+
+import numpy as np
+import tensorrt as trt
+import torch
+from transformers.configuration_utils import PretrainedConfig
+
+sys.path.append('../../HuggingFace') # Include HuggingFace directory
+from NNDF.models import TRTEngineFile
+from NNDF.networks import NetworkMetadata
+from NNDF.tensorrt_utils import TRTNativeRunner
+from NNDF.logger import G_LOGGER
+from Seq2Seq.export import DecoderTRTEngine
+
+from HuggingFace.NNDF.tensorrt_utils import TRTNativeRunner, CUASSERT
+from cuda import cudart
+
+
+class GPTTRTDecoder(TRTNativeRunner):
+
+ INPUT_IDS_INDEX = 0
+ POSITION_IDS_INDEX = 1
+ ATTENTION_MASK_INDEX = 2
+
+ def __init__(
+ self,
+ trt_engine_file: TRTEngineFile,
+ use_cache: bool,
+ use_fp8_storage: bool,
+ cfg,
+ network_metadata: NetworkMetadata = None,
+ hf_config: PretrainedConfig = None,
+ ):
+ super().__init__(trt_engine_file, network_metadata, hf_config)
+ self.use_cache = use_cache
+ self.use_fp8_storage = use_fp8_storage
+ if self.use_cache:
+ self._set_context_mode_trt_context()
+ self.io_names = set()
+ self.input_tensor_names = set()
+ for i in range(self.trt_engine.num_io_tensors):
+ tensor_name = self.trt_engine.get_tensor_name(i)
+ self.io_names.add(tensor_name)
+ if self.trt_engine.get_tensor_mode(tensor_name) == trt.TensorIOMode.INPUT:
+ self.input_tensor_names.add(tensor_name)
+
+ self.cfg = cfg
+ logits_size = self.cfg.batch_size * self.cfg.model.max_seq_len * self.cfg.model.vocab_size
+
+ self.batch_size = self.cfg.batch_size
+ self.max_seq_len = self.cfg.model.max_seq_len
+ self.num_layers = self.cfg.model.num_layers
+ self.nb_heads = self.cfg.model.nb_heads
+ self.head_size = self.cfg.model.head_size
+
+ dtype = self.get_torch_type(self.get_output_name())
+ self.logits = torch.zeros(logits_size, dtype=dtype).contiguous().cuda()
+
+
+ self.init_kv_cache()
+ self.past_decoder_length = 0
+
+ # Setting next input shape when executing gpu kernel.
+ # Use dict to record which inputs have changed.
+ self.input_shape_change_record = dict()
+
+ def init_kv_cache(self):
+ # kv cache buffer
+ self.attention_kv_cache_buffer = dict()
+ cache_dtype = torch.float16
+ if self.use_fp8_storage:
+ cache_dtype = torch.uint8
+ for i in range(self.num_layers):
+ for code in ["key", "value"]:
+ attention_kv_cache_name = self.make_kv_cache_name(i, code)
+ self.attention_kv_cache_buffer[attention_kv_cache_name] = torch.empty(
+ self.max_seq_len,
+ self.batch_size,
+ self.nb_heads,
+ self.head_size,
+ dtype=cache_dtype,
+ device=torch.cuda.current_device(),
+ ).contiguous().cuda()
+
+
+ def make_kv_cache_name(self, layer, code):
+ return f"key_values.{layer}.decoder.{code}"
+
+ def _set_context_mode_trt_context(self):
+ # Create TRT context for context mode (1st decoder run) with optimization profile index = 1
+ self.context_trt_context = self.trt_engine.create_execution_context()
+ self.context_trt_context.set_optimization_profile_async(1, self.stream)
+
+ def get_torch_type(self, name):
+ trt_type = self.trt_engine.get_tensor_dtype(name)
+ mapping = {
+ trt.float32: torch.float32,
+ trt.float16: torch.float16,
+ trt.int8: torch.int8,
+ trt.int32: torch.int32,
+ trt.int64: torch.int64,
+ trt.bool: torch.bool,
+ trt.uint8: torch.uint8,
+ trt.bfloat16: torch.bfloat16,
+ }
+ if trt_type in mapping:
+ return mapping[trt_type]
+ raise ValueError(f"Got unexpected tensorrt dtype {trt_type} in get_torch_type().")
+
+ def get_input_ids_name(self):
+ return self.trt_engine.get_tensor_name(self.INPUT_IDS_INDEX)
+
+ def has_position_ids(self):
+ # If the input at POSITION_IDS_INDEX has a dimension of 2, assume it is position_ids.
+ return len(self.trt_engine.get_tensor_shape(self.trt_engine.get_tensor_name(self.POSITION_IDS_INDEX))) == 2
+
+ def get_position_ids_name(self):
+ if self.has_position_ids():
+ return self.trt_engine.get_tensor_name(self.POSITION_IDS_INDEX)
+ else:
+ return None
+
+ def get_output_name(self):
+ return "logits"
+
+ def has_attention_mask(self):
+ if self.ATTENTION_MASK_INDEX < self.trt_engine.num_io_tensors:
+ return self.trt_engine.get_tensor_name(self.ATTENTION_MASK_INDEX) == "attention_mask"
+ return False
+
+ def get_attention_mask_name(self):
+ if self.has_attention_mask():
+ return self.trt_engine.get_tensor_name(self.ATTENTION_MASK_INDEX)
+ return None
+
+ def run(self, output_name, io_descs, seq_len, context_mode=False):
+ torch.cuda.nvtx.range_push("TRT Setup")
+ if self.use_cache:
+ if context_mode:
+ self.past_decoder_length = 0
+ else:
+ # When kv-cache is used, seq_len is always 1 in Generation phase.
+ seq_len = 1
+ cur_shape = (self.past_decoder_length, self.batch_size, self.nb_heads, self.head_size)
+ new_shape = (seq_len, self.batch_size, self.nb_heads, self.head_size)
+ assert self.past_decoder_length + seq_len < self.max_seq_len
+ offset = self.batch_size*self.nb_heads*self.head_size*self.past_decoder_length
+ for i in range(self.num_layers):
+ for code in ["key", "value"]:
+ attention_kv_cache_name = self.make_kv_cache_name(i, code)
+ cur_address = self.attention_kv_cache_buffer[attention_kv_cache_name].data_ptr()
+ # new kv address start from the past kv-cache data end
+ io_descs[f"past_{attention_kv_cache_name}"] = (cur_address, cur_shape)
+ new_address = cur_address + offset*self.attention_kv_cache_buffer[attention_kv_cache_name].element_size()
+ modifier = ""
+ if self.use_fp8_storage:
+ modifier = "_qfp8"
+ new_kv_name = f"new_{attention_kv_cache_name}{modifier}"
+ io_descs[new_kv_name] = (new_address, new_shape)
+ self.past_decoder_length += seq_len
+ else:
+ self.past_decoder_length = 0
+ # Set active optimization profile and active execution context.
+ self.trt_context.set_optimization_profile_async(self.profile_idx, self.stream)
+ active_context = self.trt_context
+ if context_mode and self.use_cache:
+ active_context = self.context_trt_context
+
+ # Set up input bindings.
+ for name, tensor_shape in io_descs.items():
+ active_context.set_tensor_address(name, tensor_shape[0])
+ if name in self.input_tensor_names:
+ if name in self.input_shape_change_record and \
+ self.input_shape_change_record[name][0] == active_context and \
+ self.input_shape_change_record[name][1] == tensor_shape[1]:
+ continue
+ else:
+ active_context.set_input_shape(name, tensor_shape[1])
+ elif self.use_cache:
+ pass
+ else:
+ assert False, "All tensors must be inputs for non-KV mode"
+ assert active_context.all_shape_inputs_specified
+
+ # Set up output bindings.
+ assert output_name == self.get_output_name()
+ engine_out_torch_type = self.get_torch_type(output_name)
+ if self.logits.dtype != engine_out_torch_type:
+ raise ValueError(f"Output data type does not match, {self.logits.dtype} vs. {engine_out_torch_type}.")
+ shape = active_context.get_tensor_shape(output_name)
+ active_context.set_tensor_address(output_name, self.logits.data_ptr())
+
+
+ # Execute inference.
+ torch.cuda.nvtx.range_pop() # "TRT Setup"
+ active_context.execute_async_v3(self.stream)
+ if not context_mode and self.use_cache:
+ self.input_shape_change_record.clear()
+ for i in range(self.num_layers):
+ for code in ["key", "value"]:
+ next_past_shape = (self.past_decoder_length, self.batch_size, self.nb_heads, self.head_size)
+ attention_kv_cache_name = self.make_kv_cache_name(i, code)
+ # set next iter input shape when cpu idle
+ active_context.set_input_shape(f"past_{attention_kv_cache_name}", next_past_shape)
+ self.input_shape_change_record[f"past_{attention_kv_cache_name}"] = [active_context, next_past_shape]
+ CUASSERT(cudart.cudaStreamSynchronize(self.stream))
+ if len(shape) != 3:
+ raise ValueError("Output must have a dimension of 3.")
+ output = self.logits[:shape[0] * shape[1] * shape[2]].view(tuple(shape))
+ return output
+
+def load_trt_model(cfg):
+ G_LOGGER.info(f'Loading TensorRT engine from {cfg.trt_engine_file} with use_cache={cfg.use_cache}, use_fp8_storage={cfg.onnx_export_options.use_fp8_storage} ')
+ trt_engine_file = DecoderTRTEngine(cfg.trt_engine_file)
+ return GPTTRTDecoder(trt_engine_file, cfg.use_cache, cfg.onnx_export_options.use_fp8_storage, cfg)
diff --git a/demo/NeMo/README.md b/demo/NeMo/README.md
new file mode 100644
index 00000000..44f183dd
--- /dev/null
+++ b/demo/NeMo/README.md
@@ -0,0 +1,156 @@
+# TensorRT FP8 Inference for NeMo models
+**Deprecation:** For all users using TensorRT to accelerate Large Language Model inference, please use [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/). TensorRT-LLM includes full model coverage and functionalities of HuggingFace demo and NeMo demo. It also contains more optimizations and functionalities (e.g. model quantization, in-flight batching, etc.), multi-GPU support, better model coverage and much better inference performance. HuggingFace Demo and NeMo demo will not be maintained, and they will be removed from OSS in TRT 10.0 release.
+
+This repository demonstrates TensorRT inference with NeMo Megatron models in FP8/FP16/BF16 precision.
+
+Currently, this repository supports [NeMo GPT](https://huggingface.co/nvidia/nemo-megatron-gpt-5B/tree/fp8) models only.
+
+# Environment Setup
+It's recommended to run inside a container to avoid conflicts when installing dependencies. Please check out [`NGC TensorRT`](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorrt/tags) and find a container with TensorRT 9.0 or above. A GPU with compute capability 8.9 or above is required to run the demo with FP8 precision.
+
+```
+# Run inside a TensorRT container
+sh install.sh [--deps ] [-j ] [--ninja]
+```
+
+All arguments are optional. `--deps` indicates the relative dependency download directory, `-j` indicates number of parallel jobs for building and `--ninja` installs the `ninja` build system which can speed up installation. See `sh install.sh --help` for more details on the arguments.
+
+> The script will install required dependencies and it can take around 30 minutes or more.
+
+**Please note that the [HuggingFace demo directory](demo/HuggingFace) needs to be visible when running this demo, so utility functions can be correctly imported.**
+
+# File Structure
+This demo follows simliar structure and command-line interface as in [HuggingFace demo](/demo/HuggingFace).
+```
+.
+├── GPT3 # GPT3 directory
+│ ├── GPT3ModelConfig.py # model configuration and variant-specific parameters
+│ ├── frameworks.py # NeMo PyTorch inference script
+│ ├── onnxrt.py # OnnxRT inference script
+│ ├── trt.py # TensorRT inference script
+│ ├── decoding.py # main inference logic for all runtimes
+│ └── ... # files with utility functions for model export and inference
+├── config.yaml # full configuration for model export and inference
+├── interface.py # definitions of setup functions
+├── nemo_export.py # export functions for NeMo model -> ONNX model -> TRT engine
+└── run.py # main entry script
+```
+
+# Overview
+
+This demo contains two scripts `run.py` and `nemo_export.py`. Script `run.py` accepts a NeMo model or an ONNX model as input, and performs end-to-end inference with various actions specified by the user. Script `nemo_export.py` accepts a NeMo model or an ONNX model as input, and exports the input to an ONNX model or a TensorRT engine.
+
+# How to run inference
+The `run` action will run end-to-end inference on sentences specified in [config.yaml](/demo/NeMo/config.yaml). A model, a variant, and precision are required to run this command.
+```
+python3 run.py run GPT3 --variant gpt-5b --working-dir $(pwd)/temp --fp8 --bf16 --nemo-model=
+```
+
+Expected output for the second sentence:
+```
+Batch 1: {'sentences': ['TensorRT is a Deep Learning compiler used for deep learning. It is a compiler for TensorFlow, CNTK, and Torch. It is a compiler for the TensorFlow, CNTK,'],
+ 'tokens': [['<|endoftext|>', 'T', 'ensor', 'RT', ' is', ' a', ' Deep', ' Learning', ' compiler', ' used', ' for', ' deep', ' learning', '.', ' It', ' is', ' a', ' compiler', ' for', ' T', 'ensor', 'Flow', ',', ' C', 'NT', 'K', ',', ' and', ' Torch', '.', ' It', ' is', ' a', ' compiler', ' for', ' the', ' T', 'ensor', 'Flow', ',', ' C', 'NT', 'K', ',']],
+ 'logprob': tensor([[-4.6415e+00, -6.9270e+00, -7.4458e+00, -1.9856e+00, -5.9787e-01,
+ -8.1058e+00, -7.9629e-02, -5.8013e+00, -5.5222e+00, -1.4401e+00,
+ -5.5644e+00, -3.3747e-01, -3.3463e+00, -1.1306e+00, -1.3685e+00,
+ -1.7793e+00, -2.8960e+00, -1.4127e+00, -2.3209e+00, -7.3454e-04,
+ -9.8682e-02, -1.3268e+00, -2.1373e+00, -3.9281e-01, -6.5222e-04,
+ -2.9425e-01, -1.4167e+00, -1.8416e+00, -9.2462e-01, -1.4805e+00,
+ -1.4299e+00, -2.0632e+00, -2.9947e+00, -9.1487e-01, -2.6651e+00,
+ -2.2772e+00, -4.7057e-03, -2.2852e-01, -2.4777e+00, -2.4731e-01,
+ -7.0602e-03, -4.7339e-04, -1.1645e-01]], device='cuda:0'),
+ 'full_logprob': None,
+ 'token_ids': [[50256, 51, 22854, 14181, 318, 257, 10766, 18252, 17050, 973, 329, 2769, 4673, 13, 632, 318, 257, 17050, 329, 309, 22854, 37535, 11, 327, 11251, 42, 11, 290, 34868, 13, 632, 318, 257, 17050, 329, 262, 309, 22854, 37535, 11, 327, 11251, 42, 11]],
+ 'offsets': [[0, 0, 1, 6, 8, 11, 13, 18, 27, 36, 41, 45, 50, 59, 60, 63, 66, 68, 77, 81, 83, 88, 92, 93, 95, 97, 98, 99, 103, 109, 110, 113, 116, 118, 127, 131, 135, 137, 142, 146, 147, 149, 151, 152]]}
+```
+
+# How to run with various configurations
+- FP8, FP16, and BF16 precisions are supported, and they can be set through `--fp8`, `--fp16`, and `--bf16` respectively. Currently, the script has constraints on how precisions are specified, and supported combinations are:
+ 1. Pure FP16: `--fp16` (default)
+ 2. Pure BF16: `--bf16`
+ 3. FP8-FP16: `--fp8 --fp16`
+ 4. FP8-BF16: `--fp8 --bf16`
+
+- `--nemo-model=` or `--nemo-checkpoint=` can be used to load a NeMo model or checkpoint from a specified path, respectively. If these arguments are not provided, a NeMo model will be downloaded (and cached/re-used for subsequent runs) in the working directory.
+
+- K-V cache can be enabled through `--use-cache`
+
+- Batch size can be changed through `--batch-size=`
+
+- Default max sequence length is `256`, can be changed through `--max-seq-len=`
+
+# How to run performance benchmark
+The `benchmark` action will run inference with specified input and output sequence lengths multiple times.
+```
+python3 run.py benchmark GPT3 --variant gpt-5b --working-dir $(pwd)/temp --fp8 --bf16 --nemo-model= --batch-size=16 --input-seq-len=128 --output-seq-len=20 --use-cache --warmup=10 --iterations=100
+```
+
+Expected output for `trt`:
+```
+***************************
+Running 100 iterations with batch size: 16, input sequence length: 128 and output sequence length: 20
+[E2E inference] Total Time: 11.55453 s, Average Time: 0.11555 s, 95th Percentile Time: 0.11581 s, 99th Percentile Time: 0.11587 s, Throughput: 2769.48 tokens/s
+[Without tokenizer] Total Time: 10.44539 s, Average Time: 0.10445 s, 95th Percentile Time: 0.10459 s, 99th Percentile Time: 0.10465 s, Throughput: 3063.55 tokens/s
+***************************
+```
+
+Expected output for `frameworks`:
+```
+***************************
+Running 100 iterations with batch size: 16, input sequence length: 128 and output sequence length: 20
+[E2E inference] Total Time: 55.23503 s, Average Time: 0.55235 s, 95th Percentile Time: 0.55525 s, 99th Percentile Time: 0.56992 s, Throughput: 579.34 tokens/s
+[Without tokenizer] Total Time: 54.06591 s, Average Time: 0.54066 s, 95th Percentile Time: 0.54369 s, 99th Percentile Time: 0.55839 s, Throughput: 591.87 tokens/s
+***************************
+```
+
+# How to run accuracy check
+The `accuracy` action will run accuracy check on a dataset. Default is to use [LAMBADA](https://paperswithcode.com/dataset/lambada) dataset.
+```
+python3 run.py accuracy GPT3 --variant gpt-5b --working-dir $(pwd)/temp --fp8 --bf16 --nemo-model= --use-cache
+```
+
+Expected output for `trt`:
+```
+***************************
+Lambada ppl(last token): 4.4756, ppl(sequence): 18.3254, acc(top1): 0.6722, acc(top3): 0.8597, acc(top5): 0.9076
+***************************
+```
+
+Expected output for `frameworks`:
+```
+***************************
+Lambada ppl(last token): 4.4669, ppl(sequence): 18.3161, acc(top1): 0.6765, acc(top3): 0.8612, acc(top5): 0.9082
+***************************
+```
+
+# How to export a NeMo model to ONNX
+NeMo to ONNX conversion consists of 3 steps:
+1. Export ONNX from NeMo.
+2. NeMo uses TransformerEngine to export FP8 models to ONNX (step 1) and the exported ONNX has custom TensorRT Q/DQ nodes. Script `convert_te_onnx_to_trt_onnx.py` can be used to convert the custom operators into standard opset19 ONNX Q/DQ nodes.
+3. Add KV-cache inputs and outputs to the exported ONNX, so it is faster when performing inference on the model.
+
+`nemo_export.py` has `--opset19` and `--use-cache` option to decide whether to perform step 2. and step 3., respectively:
+```
+python3 nemo_export.py --nemo-model=model.nemo --onnx=onnx/model.onnx --opset19 --use-cache
+```
+`--extra-configs` can be used to specified configs that are defined in `config.yml` but not being exposed from existing command-line interface.
+Please specify `--help` to see more options.
+
+
+# How to run sparsity for benchmark
+
+*Note: this is for performance analysis. The pruned model should not be used for accuracy purpose unless it was fine-tuned for sparsity. The pruning may take minutes or hours depending on the model size.*
+
+
+1. Enable sparsity knobs in `config.yaml`:
+ * Set `onnx_export_options.prune` to `True` to enable pruning of the ONNX model.
+ * Set `trt_export_options.sparse` to `True` to enable sparse tactics profiling in TensorRT.
+2. Run the scripts. You should be able to see logs like below.
+
+```
+[2023-07-28 00:15:03,015][OSS][INFO] Prune ONNX model with: polygraphy surgeon prune ${OSS_ROOT}/demo/NeMo/temp/gpt-5b/GPT3-gpt-5b-fp8-fp16-ms256/onnx/model-16.opset19.onnx -o ${OSS_ROOT}/demo/NeMo/temp/gpt-5b/GPT3-gpt-5b-fp8-fp16-ms256/onnx/pruned.model-16.opset19.onnx --save-external-data ${OSS_ROOT}/demo/NeMo/temp/gpt-5b/GPT3-gpt-5b-fp8-fp16-ms256/onnx/pruned.model-16.opset19.onnx_data
+[2023-07-28 00:15:03,016][OSS][INFO] This may take a while...
+...
+
+[2023-07-28 03:36:52,307][OSS][DEBUG] trtexec --onnx=${OSS_ROOT}/demo/NeMo/temp/gpt-5b/GPT3-gpt-5b-fp8-fp16-ms256/onnx/pruned.model-16.opset19.onnx --minShapes=input_ids:1x1,position_ids:1x1 --optShapes=input_ids:1x128,position_ids:1x128 --maxShapes=input_ids:1x256,position_ids:1x256 --fp8 --fp16 --sparsity=enable --timingCacheFile=functional.cache
+```
diff --git a/demo/NeMo/apex.patch b/demo/NeMo/apex.patch
new file mode 100644
index 00000000..daa1b615
--- /dev/null
+++ b/demo/NeMo/apex.patch
@@ -0,0 +1,29 @@
+diff --git a/setup.py b/setup.py
+index cb1a790..949f877 100644
+--- a/setup.py
++++ b/setup.py
+@@ -29,15 +29,15 @@ def check_cuda_torch_binary_vs_bare_metal(cuda_dir):
+ print("\nCompiling cuda extensions with")
+ print(raw_output + "from " + cuda_dir + "/bin\n")
+
+- if (bare_metal_version != torch_binary_version):
+- raise RuntimeError(
+- "Cuda extensions are being compiled with a version of Cuda that does "
+- "not match the version used to compile Pytorch binaries. "
+- "Pytorch binaries were compiled with Cuda {}.\n".format(torch.version.cuda)
+- + "In some cases, a minor-version mismatch will not cause later errors: "
+- "https://github.com/NVIDIA/apex/pull/323#discussion_r287021798. "
+- "You can try commenting out this check (at your own risk)."
+- )
++ # if (bare_metal_version != torch_binary_version):
++ # raise RuntimeError(
++ # "Cuda extensions are being compiled with a version of Cuda that does "
++ # "not match the version used to compile Pytorch binaries. "
++ # "Pytorch binaries were compiled with Cuda {}.\n".format(torch.version.cuda)
++ # + "In some cases, a minor-version mismatch will not cause later errors: "
++ # "https://github.com/NVIDIA/apex/pull/323#discussion_r287021798. "
++ # "You can try commenting out this check (at your own risk)."
++ # )
+
+
+ def raise_if_cuda_home_none(global_option: str) -> None:
diff --git a/demo/NeMo/config.yaml b/demo/NeMo/config.yaml
new file mode 100644
index 00000000..2b1888bb
--- /dev/null
+++ b/demo/NeMo/config.yaml
@@ -0,0 +1,87 @@
+runtime: null
+gpt_model_file: null # GPT nemo file path
+onnx_model_file: null # ONNX file path
+trt_engine_file: null # TRT engine file path
+
+# Parameters for loading from a checkpoint
+checkpoint_dir: null # Path to a folder that contains a .ckpt file
+checkpoint_name: null # Name of the .ckpt file within the checkpoint_dir.
+hparams_file: null # Path to a .yaml file that contains the hyperparameters of the checkpoint.
+
+batch_size: 1
+use_cache: True
+use_one_input: False # export ONNX model with only one input
+prompts: # prompts for GPT inference
+ - "How are you?"
+ - "TensorRT is a Deep Learning compiler used for deep learning."
+
+mode: 'inference' # Could change to accuracy or benchmark
+
+inference:
+ greedy: True # Whether or not to use sampling ; use greedy decoding otherwise
+ top_k: 0 # The number of highest probability vocabulary tokens to keep for top-k-filtering.
+ top_p: 0.9 # If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.
+ temperature: 1.0 # sampling temperature
+ add_BOS: True # add the bos token at the begining of the prompt
+ tokens_to_generate: 30 # The maximum length of the sequence to be generated.
+ all_probs: False # whether return the log prob for all the tokens in vocab
+ repetition_penalty: 1.2 # The parameter for repetition penalty. 1.0 means no penalty.
+ min_tokens_to_generate: 0 # The minimum length of the sequence to be generated.
+ compute_logprob: False # a flag used to compute logprob of all the input text, a very special case of running inference, default False
+ seed: 1234
+
+accuracy:
+ dataset: Lambada
+ metric: Perplexity
+ top_n: 1,3,5
+ tokens_to_generate: 5
+
+benchmark:
+ input_seq_len: 20
+ output_seq_len: 20
+
+# for nemo to onnx export
+onnx_export_options:
+ runtime_check: False
+ verbose: False
+ onnx_opset: 17
+ do_constant_folding: True
+ cache_support: False
+ prune: False # Prune the ONNX model for Sparse Tensor Cores 2:4 pattern
+ device: 'cuda'
+ check_tolerance: 0.01
+ use_fp8_storage: False
+ quantize_bmms: False
+
+# for onnx to trt export
+trt_export_options:
+ opt_seq_len: 128 # define the optimized sequence length
+ use_tf32: True
+ use_fp16: False
+ use_fp8: False
+ use_bf16: False
+ use_strongly_typed: True # enable strongly typed mode will invalidate `use_[fp8|fp16|bf16]` flags.
+ sparse: False # enable sparse in TRT engine builder
+ timing_cache: 'functional.cache'
+
+trainer:
+ devices: 1
+ num_nodes: 1
+ accelerator: gpu
+ logger: False # logger provided by exp_manager
+ precision: 32 # 16, 32, or bf16
+
+tensor_model_parallel_size: 1
+pipeline_model_parallel_size: 1
+pipeline_model_parallel_split_rank: 0 # used for encoder and decoder model (0 for others)
+
+# model architecture
+model:
+ max_seq_len: 256 # define the max sequence length for attention mask
+ encoder_seq_length: 2048
+ max_position_embeddings: ${.encoder_seq_length}
+ num_layers: 24
+ hidden_size: 4096
+ nb_heads: 32
+ head_size: 128
+ vocab_size: 50304
diff --git a/demo/NeMo/install.sh b/demo/NeMo/install.sh
new file mode 100644
index 00000000..277f250a
--- /dev/null
+++ b/demo/NeMo/install.sh
@@ -0,0 +1,485 @@
+#!/bin/sh
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Sourcing messes up the directory detection with readlink.
+if [ ! "${0##*/}" = "install.sh" ]; then
+ echo "Please run this install script, don't source it." >&2
+ echo "Use -h for usage and help." >&2
+ return 1
+fi
+
+NEMO_DIR=$(dirname "$(readlink -f "$0")")
+DEMO_DIR=$(dirname "${NEMO_DIR}")
+SCRIPT_DIR=$(dirname "${DEMO_DIR}")/scripts
+
+DEPENDENCIES_DIR="temp"
+BUILD_SRCLIBS=1
+BUILD_NINJA=0
+ARG_JOBS=1
+ARG_HELP=0
+
+install_essential_tools() {
+ pip_not_found=$(pip --version 2>&1 | grep -o "not found")
+ if [ "$pip_not_found" != "" ]; then
+ echo " > Installing pip..."
+ apt-get update
+ apt-get install -y python3-dev
+ cd "${1}" || exit
+ if [ ! -f "get-pip.py" ]; then
+ apt-get install -y wget
+ wget https://bootstrap.pypa.io/get-pip.py
+ fi
+ python3 get-pip.py
+ cd ..
+ fi
+
+ git_not_found=$(git --version 2>&1 | grep -o "not found")
+ if [ "$git_not_found" != "" ]; then
+ echo " > Installing git..."
+ apt-get update
+ apt-get install -y git
+ fi
+}
+
+install_ninja() {
+ if [ ! -d "ninja" ]; then
+ git clone https://github.com/ninja-build/ninja.git
+ fi
+ cd ninja || exit
+ git checkout v1.11.1
+
+ if [ ! -x "./ninja" ]; then
+ CMD="python3 configure.py --bootstrap"
+ echo " >> ${CMD}"
+ eval "${CMD}"
+ unset CMD
+ else
+ echo " > ninja already built!"
+ fi
+
+ PATH_WITH_NINJA="$(pwd):${PATH}"
+ # Path exported for the current program scope only.
+ export PATH="${PATH_WITH_NINJA}"
+ unset PATH_WITH_NINJA
+ cd ..
+}
+
+PACKAGE_NEEDS_REINSTALL=0
+
+check_if_managed_install() {
+ PACKAGE_NEEDS_REINSTALL=0
+ dist_path="${1}"
+ # https://packaging.python.org/en/latest/specifications/direct-url/
+ if [ ! -f "${dist_path}/direct_url.json" ]; then
+ PACKAGE_NEEDS_REINSTALL=1
+ return
+ fi
+ if [ "$(grep -c "${NEMO_DIR}" "${dist_path}/direct_url.json")" != "1" ]; then
+ PACKAGE_NEEDS_REINSTALL=1
+ fi
+}
+
+apex_install_logic() {
+ if [ ! -d "apex" ]; then
+ git clone https://github.com/NVIDIA/apex.git
+ fi
+
+ cd apex || exit
+ APEX_PATH="$(pwd)"
+ git config --global --add safe.directory "${APEX_PATH}"
+ unset APEX_PATH
+
+ git checkout 5b5d41034b506591a316c308c3d2cd14d5187e23
+ git apply "${NEMO_DIR}"/apex.patch # Bypass CUDA version check in apex
+
+ torchcppext=$(pip show torch | grep Location | cut -d' ' -f2)"/torch/utils/cpp_extension.py"
+ if [ ! -f "$torchcppext" ]; then
+ echo "Could not locate torch installation using pip"
+ exit 1
+ fi
+ sed -i 's/raise RuntimeError(CUDA_MISMATCH_MESSAGE.format(cuda_str_version, torch.version.cuda))/pass/' "$torchcppext" # Bypass CUDA version check in torch
+ unset torchcppext
+
+ CMD="MAX_JOBS=${ARG_JOBS} python3 setup.py bdist_wheel -v --cpp_ext --cuda_ext --fast_layer_norm --distributed_adam --deprecated_fused_adam"
+ echo " >> ${CMD}"
+ eval "${CMD}"
+ unset CMD
+
+ python3 -m pip install "$(find './dist' -name '*.whl' | head -n1)"
+ cd ../
+}
+
+check_if_apex_needs_reinstall() {
+ apex_loc="$(pip show apex | grep '^Location' | awk '{print $2}')"
+ apex_dist_loc="$(find "${apex_loc}" -depth -maxdepth 1 -name 'apex*dist-info' -type d | head -n1)"
+
+ check_if_managed_install "${apex_dist_loc}"
+ apex_needs_reinstall=${PACKAGE_NEEDS_REINSTALL}
+ echo "${apex_needs_reinstall}"
+
+ unset apex_dist_loc
+ unset apex_loc
+}
+
+install_apex() {
+ has_apex=$(pip list | grep "^apex " | grep "apex" -o | awk '{print $1}' | awk '{print length}')
+ apex_needs_reinstall=0
+
+ if [ "$has_apex" != "4" ]; then
+ apex_install_logic
+ else
+ check_if_apex_needs_reinstall
+ if [ "$apex_needs_reinstall" != "0" ]; then
+ echo " > Reinstalling Apex per demo version..."
+ python3 -m pip uninstall -y apex
+ apex_install_logic
+ else
+ echo " > Apex already installed!"
+ fi
+ fi
+ unset apex_needs_reinstall
+ unset has_apex
+}
+
+megatron_install_logic() {
+ if [ ! -d "Megatron-LM" ]; then
+ git clone -b main https://github.com/NVIDIA/Megatron-LM.git
+ fi
+
+ cd Megatron-LM || exit
+ MEGATRON_PATH="$(pwd)"
+ git config --global --add safe.directory "${MEGATRON_PATH}"
+ unset MEGATRON_PATH
+
+ git checkout 992da75a1fd90989eb1a97be8d9ff3eca993aa83
+ CMD="python3 -m pip install ./"
+ echo " >> ${CMD}"
+ eval "${CMD}"
+ unset CMD
+ cd ../
+}
+
+check_if_megatron_needs_reinstall() {
+ megatron_loc="$(pip show megatron-core | grep '^Location' | awk '{print $2}')"
+ megatron_dist_loc="$(find "${megatron_loc}" -depth -maxdepth 1 -name 'megatron*dist-info' -type d | head -n1)"
+
+ check_if_managed_install "${megatron_dist_loc}"
+ megatron_needs_reinstall=${PACKAGE_NEEDS_REINSTALL}
+
+ unset megatron_dist_loc
+ unset megatron_loc
+}
+
+install_megatron() {
+ has_megatron=$(pip list | grep "^megatron-core " | grep "megatron-core" -o | awk '{print $1}' | awk '{print length}')
+ megatron_needs_reinstall=0
+
+ if [ "$has_megatron" != "13" ]; then
+ megatron_install_logic
+ else
+ check_if_megatron_needs_reinstall
+ if [ "$megatron_needs_reinstall" != "0" ]; then
+ echo " > Reinstalling Megatron per demo version..."
+ python3 -m pip uninstall -y megatron-core
+ megatron_install_logic
+ else
+ echo " > Megatron already installed!"
+ fi
+ fi
+ unset megatron_needs_reinstall
+ unset has_megatron
+}
+
+flash_attention_install_logic() {
+ if [ ! -d "flash-attention" ]; then
+ git clone https://github.com/HazyResearch/flash-attention.git
+ fi
+
+ cd flash-attention || exit
+ FLASH_ATTENTION_PATH="$(pwd)"
+ git config --global --add safe.directory "${FLASH_ATTENTION_PATH}"
+ unset FLASH_ATTENTION_PATH
+
+ git checkout v1.0.6
+ CMD="MAX_JOBS=${ARG_JOBS} python3 setup.py bdist_wheel"
+ echo " >> ${CMD}"
+ eval "${CMD}"
+ unset CMD
+ python3 -m pip install "$(find './dist' -name '*.whl' | head -n1)"
+ cd ..
+}
+
+check_if_flash_attention_needs_reinstall() {
+ flash_attn_loc="$(pip show flash-attn | grep '^Location' | awk '{print $2}')"
+ flash_attn_dist_loc="$(find "${flash_attn_loc}" -depth -maxdepth 1 -name 'flash_attn*dist-info' -type d | head -n1)"
+
+ check_if_managed_install "${flash_attn_dist_loc}"
+ flash_attn_needs_reinstall=${PACKAGE_NEEDS_REINSTALL}
+
+ unset flash_attn_dist_loc
+ unset flash_attn_loc
+}
+
+install_flash_attention() {
+ has_flashattn=$(pip list | grep "^flash-attn " | grep "flash-attn" -o | awk '{print $1}' | awk '{print length}')
+ flash_attn_needs_reinstall=0
+
+ if [ "$has_flashattn" != "10" ]; then
+ flash_attention_install_logic
+ else
+ check_if_flash_attention_needs_reinstall
+ if [ "$flash_attn_needs_reinstall" != "0" ]; then
+ echo " > Reinstalling flash_attn per demo version..."
+ python3 -m pip uninstall -y flash-attn
+ flash_attention_install_logic
+ else
+ echo " > flash-attention already installed!"
+ fi
+ fi
+
+ unset flash_attn_needs_reinstall
+ unset has_flashattn
+}
+
+transformer_engine_install_logic() {
+ if [ ! -d "TransformerEngine" ]; then
+ git clone https://github.com/NVIDIA/TransformerEngine.git
+ fi
+
+ cd TransformerEngine || exit
+ TRANSFORMER_ENGINE_PATH="$(pwd)"
+ git config --global --add safe.directory "${TRANSFORMER_ENGINE_PATH}"
+ unset TRANSFORMER_ENGINE_PATH
+
+ git checkout 804f120322a13cd5f21ea8268860607dcecd055c
+ git submodule update --recursive --init
+ CMD="MAKEFLAGS=-j${ARG_JOBS} MAX_JOBS=${ARG_JOBS} python3 setup.py bdist_wheel --framework=pytorch"
+ echo " >> ${CMD}"
+ eval "${CMD}"
+ unset CMD
+ python3 -m pip install "$(find './dist' -name '*.whl' | head -n1)"
+ cd ..
+
+ # Check for common point of failure with TE.
+ has_te_loc=$(pip list | grep "^transformer-engine " | grep "transformer-engine" -o | awk '{print $1}' | awk '{print length}')
+ [ "$has_te_loc" != "18" ] && {
+ echo " > TransformerEngine install failed. Probable cause of failures:"
+ echo " - CUDNN location was not picked up. If your CUDNN include dir"
+ echo " is /path/to/cudnn/include and lib is /path/to/cudnn/lib, "
+ echo " Invoke the script as CUDNN_PATH=/path/to/cudnn sh install.sh ..."
+ exit 1
+ }
+ unset has_te_loc
+}
+
+check_if_transformer_engine_needs_reinstall() {
+ te_loc="$(pip show transformer-engine | grep '^Location' | awk '{print $2}')"
+ te_dist_loc="$(find "${te_loc}" -depth -maxdepth 1 -name 'transformer_engine*dist-info' -type d | head -n1)"
+
+ check_if_managed_install "${te_dist_loc}"
+ te_needs_reinstall=${PACKAGE_NEEDS_REINSTALL}
+
+ unset te_dist_loc
+ unset te_loc
+}
+
+install_transformer_engine() {
+ has_te=$(pip list | grep "^transformer-engine " | grep "transformer-engine" -o | awk '{print $1}' | awk '{print length}')
+ te_needs_reinstall=0
+
+ if [ "$has_te" != "18" ]; then
+ transformer_engine_install_logic
+ else
+ check_if_transformer_engine_needs_reinstall
+ if [ "$te_needs_reinstall" != "0" ]; then
+ echo " > Reinstalling TransformerEngine per demo version..."
+ python3 -m pip uninstall -y transformer-engine
+ transformer_engine_install_logic
+ else
+ echo " > TransformerEngine already installed!"
+ fi
+ fi
+
+ unset te_needs_reinstall
+ unset has_te
+
+ # Patch TE files.
+ sh "${NEMO_DIR}/patch_te.sh"
+}
+
+nemo_install_logic() {
+ if [ ! -d "NeMo" ]; then
+ git clone --branch main --single-branch https://github.com/NVIDIA/NeMo.git NeMo
+ fi
+
+ cd NeMo || exit
+ NeMo_PATH="$(pwd)"
+ git config --global --add safe.directory "${NeMo_PATH}"
+ unset NeMo_PATH
+
+ git checkout bf270794267e0240d8a8b2f2514c80c6929c76f1
+ bash reinstall.sh
+ cd ../
+}
+
+check_if_nemo_needs_reinstall() {
+ nemo_loc="$(pip show nemo-toolkit | grep '^Location' | awk '{print $2}')"
+ nemo_dist_loc="$(find "${nemo_loc}" -depth -maxdepth 1 -name 'nemo_toolkit*dist-info' -type d | head -n1)"
+
+ check_if_managed_install "${nemo_dist_loc}"
+ nemo_needs_reinstall=${PACKAGE_NEEDS_REINSTALL}
+
+ unset nemo_dist_loc
+ unset nemo_loc
+}
+
+install_nemo() {
+ has_nemo=$(pip list | grep "^nemo-toolkit " | grep "nemo-toolkit" -o | awk '{print $1}' | awk '{print length}')
+ nemo_needs_reinstall=0
+
+ if [ "$has_nemo" != "12" ]; then
+ nemo_install_logic
+ else
+ check_if_nemo_needs_reinstall
+ if [ "$nemo_needs_reinstall" != "0" ]; then
+ echo " > Reinstalling NeMo per demo version..."
+ python3 -m pip uninstall -y nemo-toolkit
+ nemo_install_logic
+ else
+ echo " > NeMo already installed!"
+ fi
+ fi
+}
+
+while [ "$#" -gt 0 ]; do
+ case $1 in
+ --deps)
+ DEPENDENCIES_DIR="$2"
+ shift
+ ;;
+ -j | --jobs)
+ ARG_JOBS="$2"
+ shift
+ ;;
+ --ninja) BUILD_NINJA=1 ;;
+ --skipsrc) BUILD_SRCLIBS=0 ;;
+ -h | --help) ARG_HELP=1 ;;
+ *)
+ echo "Unknown parameter passed: $1"
+ echo "For help type: $0 --help"
+ exit 1
+ ;;
+ esac
+ shift
+done
+
+if [ "$ARG_HELP" -eq "1" ]; then
+ echo "Usage: sh $0 [options]"
+ echo "All arguments are optional."
+ echo " --help or -h : Print this help menu."
+ echo " [--deps] {temp} : Path to download and build dependencies."
+ echo " [-j | --jobs] {1} : Number of jobs to use for building from source."
+ echo " [--ninja] : Flag to build ninja (if not present) to speed up installation."
+ # skipsrc is not documented to prevent users from invoking it directly.
+ exit
+fi
+
+DEPENDENCIES_DIR="${NEMO_DIR}/${DEPENDENCIES_DIR}"
+echo " > Using ${DEPENDENCIES_DIR}' to store dependencies."
+mkdir -p "${DEPENDENCIES_DIR}"
+install_essential_tools "${DEPENDENCIES_DIR}"
+
+echo " > Installing Requirements.txt..."
+pip install --upgrade pip
+pip install nvidia-pyindex || {
+ echo "Could not install nvidia-pyindex, stopping install"
+ exit 1
+}
+# # One of the hidden dependencies require Cython, but doesn't specify it.
+# # https://github.com/VKCOM/YouTokenToMe/pull/108
+# # WAR by installing Cython before requirements.
+pip install "Cython==0.29.36" || {
+ echo "Could not install Cython, stopping install"
+ exit 1
+}
+# PyYaml, Cython and pip don't play well together.
+# https://github.com/yaml/pyyaml/issues/601
+pip install "pyyaml==5.4.1" --no-build-isolation || {
+ echo "Could not install PyYaml, stopping install"
+ exit 1
+}
+# Install a specific version of opencc to WAR a GLIBC not found error.
+pip install "opencc==1.1.6" || {
+ echo "Could not install OpenCC, stopping install"
+ exit 1
+}
+pip install -r requirements.txt || {
+ echo "Could not install dependencies, stopping install"
+ exit 1
+}
+
+# Installation from source
+if [ "$BUILD_SRCLIBS" -eq "1" ]; then
+ (command -v -- "ninja" >/dev/null 2>&1) || [ "$BUILD_NINJA" -eq "0" ] && echo " > Could not locate ninja, consider passing the --ninja flag to speedup dependency installation."
+fi
+
+cd "${DEPENDENCIES_DIR}" || exit
+if (! command -v -- "ninja" >/dev/null 2>&1) && [ "$BUILD_NINJA" -eq "1" ]; then
+ echo " > Building ninja..."
+ install_ninja
+fi
+
+if [ "$BUILD_SRCLIBS" -eq "1" ]; then
+ echo " > Installing Apex..."
+ install_apex
+fi
+
+echo " > Installing Megatron-LM..."
+install_megatron
+
+if [ "$BUILD_SRCLIBS" -eq "1" ]; then
+ echo " > Installing flash-attention..."
+ install_flash_attention
+fi
+
+if [ "$BUILD_SRCLIBS" -eq "1" ]; then
+ echo " > Installing TransformerEngine..."
+ install_transformer_engine
+fi
+
+echo " > Installing NeMo..."
+install_nemo
+
+if [ ! -f "${NEMO_DIR}/GPT3/convert_te_onnx_to_trt_onnx.py" ]; then
+ echo " > Copying opset19 conversion script..."
+ if [ ! -f "${SCRIPT_DIR}/convert_te_onnx_to_trt_onnx.py" ]; then
+ echo "Opset19 conversion script is not located at /scripts/convert_te_onnx_to_trt_onnx.py"
+ return 1
+ fi
+ cp "${SCRIPT_DIR}/convert_te_onnx_to_trt_onnx.py" "${NEMO_DIR}/GPT3/convert_te_onnx_to_trt_onnx.py"
+fi
+
+cd ../
+
+unset ARG_HELP
+unset ARG_JOBS
+unset BUILD_NINJA
+unset DEPENDENCIES_DIR
+unset SCRIPT_DIR
+unset DEMO_DIR
+unset NEMO_DIR
diff --git a/demo/NeMo/interface.py b/demo/NeMo/interface.py
new file mode 100644
index 00000000..ec3dcbf7
--- /dev/null
+++ b/demo/NeMo/interface.py
@@ -0,0 +1,727 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from datetime import datetime
+import os
+import random
+import sys
+import time
+from typing import List, Union, Dict
+from copy import copy
+
+from cuda import cuda
+from tqdm import tqdm
+import numpy as np
+import torch
+
+from transformers import PretrainedConfig
+from omegaconf import OmegaConf, listconfig
+
+# Add syspath for custom library
+if __name__ == "__main__":
+ filepath = os.path.dirname(os.path.abspath(__file__))
+ project_root = os.path.join(filepath, os.pardir)
+ sys.path.append(project_root)
+
+from GPT3.decoding import full_inference, generate, process_output
+from GPT3.GPT3ModelConfig import GPT3ModelTRTConfig
+from GPT3.lambada_dataset import Lambada
+from GPT3.nemo_utils import get_computeprob_response
+from GPT3.sequence_perplexity import SequencePerplexity
+
+sys.path.append('../HuggingFace') # Include HuggingFace
+from NNDF.general_utils import NNFolderWorkspace
+from NNDF.logger import G_LOGGER
+from NNDF.networks import (
+ Precision,
+ NetworkMetadata,
+ TimingProfile,
+ BenchmarkingResult,
+ NetworkResult,
+ NetworkCheckpointResult,
+)
+from NNDF.interface import NetworkCommand
+
+# Manually set by referring to examples/nlp/language_modeling/conf/megatron_gpt_config.yaml
+# If a field cannot be found, set to None.
+DEFAULT_CONFIG = {
+ "is_encoder_decoder": False,
+ "is_decoder": True,
+ "architectures": [ "GPT3NeMoModel" ],
+}
+
+GPT3CONFIG_MAPPINGS = {
+ "gpt-126m": PretrainedConfig.from_dict(dict({"_name_or_path": "gpt-126m",
+ "num_heads": 12,
+ "num_layers": 12,
+ "hidden_size": 768,
+ "max_position_embeddings": 2048,
+ "min_seq_len": 0,
+ }, **DEFAULT_CONFIG)),
+ "gpt-1.3b": PretrainedConfig.from_dict(dict({"_name_or_path": "gpt-1.3b",
+ "num_heads": 16,
+ "num_layers": 24,
+ "hidden_size": 2048,
+ "max_position_embeddings": 2048,
+ "min_seq_len": 0,
+ }, **DEFAULT_CONFIG)),
+ "gpt-5b": PretrainedConfig.from_dict(dict({"_name_or_path": "gpt-5b",
+ "num_heads": 32,
+ "num_layers": 24,
+ "hidden_size": 4096,
+ "max_position_embeddings": 2048,
+ "min_seq_len": 16,
+ }, **DEFAULT_CONFIG)),
+}
+
+def _hf_hub_metadata(variant: str, fp8: bool) -> Dict[str, str]:
+ repo_mappings = {
+ "gpt-1.3b": "nvidia/nemo-megatron-gpt-1.3B",
+ "gpt-5b": "nvidia/nemo-megatron-gpt-5B",
+ }
+
+ try:
+ repo_id = repo_mappings[variant]
+ except KeyError:
+ raise RuntimeError(
+ "Variant should be one of {}, got {}".format(
+ list(repo_mappings.keys()), variant
+ )
+ )
+
+ file_key = (variant, "fp8" if fp8 else "fp16")
+ file_mappings = {
+ ("gpt-1.3b", "fp8"): ("nemo_gpt1.3B_fp16.nemo", None),
+ ("gpt-1.3b", "fp16"): ("nemo_gpt1.3B_fp16.nemo", None),
+ ("gpt-5b", "fp8"): ("nemo_gpt5B_fp8_bf16_tp1.nemo", "fp8"),
+ ("gpt-5b", "fp16"): ("nemo_gpt5B_fp16_tp1.nemo", None),
+ }
+
+ try:
+ filename, branch = file_mappings[file_key]
+ except KeyError:
+ raise RuntimeError(
+ "Downloading nemo file for variant : {}, precision : {} from huggingface hub is unsupported. Consider passing a nemo-model or onnx-model from the command line.".format(
+ file_key[0], file_key[1]
+ )
+ )
+
+ return {"repo_id": repo_id, "filename": filename, "revision": branch}
+
+
+def download_model(dst_dir: str, cache_dir: str, *args, **kwargs) -> str:
+ from huggingface_hub import hf_hub_download
+
+ os.makedirs(dst_dir, exist_ok=True)
+ os.makedirs(cache_dir, exist_ok=True)
+
+ model_metadata = _hf_hub_metadata(*args, **kwargs)
+ return hf_hub_download(
+ local_dir=str(dst_dir),
+ local_dir_use_symlinks="auto",
+ cache_dir=cache_dir,
+ **model_metadata,
+ )
+
+
+def load_dataset(dataset_name, base_dir, tokens_to_generate, padding):
+ ds_map = {"Lambada": Lambada(base_dir, tokens_to_generate, padding)}
+ return ds_map[dataset_name]
+
+def get_accuracy_metric(cfg):
+ topN = [int(i.strip()) for i in cfg.top_n.split(",")]
+ m_map = {"Perplexity": SequencePerplexity(topN)}
+ return m_map[cfg.metric]
+
+def remove_padded_prompts(output, nb_paddings):
+ if nb_paddings == 0:
+ return output
+ result = {}
+ for k, v in output.items():
+ if v != None and (type(v) is list or type(v) is torch.Tensor):
+ v = v[:-nb_paddings]
+ result[k] = v
+ return result
+
+def get_random_input(tokenizer, batch_size, in_seq_len, out_seq_len):
+ vocab_size = tokenizer.tokenizer.vocab_size
+ return (torch.randint(0, vocab_size, (batch_size, in_seq_len + out_seq_len), dtype=torch.int64).cuda(),
+ (torch.ones(batch_size, dtype=torch.int64) * in_seq_len).cuda())
+
+class BaseModel(torch.nn.Module):
+ def __init__(self):
+ super(BaseModel, self).__init__()
+ self.model = None
+ def forward(self, x):
+ raise Exception("BaseModel forward method is not intended to be called.")
+
+class NeMoCommand(NetworkCommand):
+ def __init__(
+ self,
+ nemo_cfg,
+ config_class,
+ description,
+ **kwargs
+ ):
+ self.nemo_cfg = nemo_cfg
+ super().__init__(config_class, description, **kwargs)
+
+ def validate_and_set_precision(self, fp8, fp16, bf16, use_fp8_storage, quantize_bmms):
+ if fp8:
+ if fp16:
+ G_LOGGER.info("Use FP8-FP16 precision.")
+ if bf16:
+ G_LOGGER.info("Use FP8-BF16 precision.")
+ elif fp16:
+ G_LOGGER.info("Use pure FP16 precision.")
+ elif bf16:
+ G_LOGGER.info("Use pure BF16 precision.")
+ else:
+ fp16 = True
+ G_LOGGER.warn("Precision is not specified. Use pure FP16 precision by default.")
+
+ self.fp8, self.fp16, self.bf16 = fp8, fp16, bf16
+ self.nemo_cfg.trt_export_options.use_fp8 = fp8
+ self.nemo_cfg.trt_export_options.use_fp16 = fp16
+ self.nemo_cfg.trt_export_options.use_bf16 = bf16
+ self.nemo_cfg.onnx_export_options.use_fp8_storage = use_fp8_storage
+ self.nemo_cfg.onnx_export_options.quantize_bmms = quantize_bmms
+
+ if fp16:
+ self.nemo_cfg.trainer.precision = "16"
+ elif bf16:
+ self.nemo_cfg.trainer.precision = "bf16"
+ else:
+ self.nemo_cfg.trainer.precision = "32"
+
+ def update_hyperparams(self, model_config):
+ self.nemo_cfg.model.num_layers = model_config.num_layers
+ self.nemo_cfg.model.nb_heads = model_config.num_heads
+ self.nemo_cfg.model.head_size = model_config.hidden_size // model_config.num_heads
+ self.nemo_cfg.model.hidden_size = model_config.hidden_size
+ self.nemo_cfg.model.encoder_seq_length = model_config.max_position_embeddings
+ self.nemo_cfg.model.max_position_embeddings = model_config.max_position_embeddings
+
+ def setup_environment(
+ self,
+ variant: str,
+ working_dir: str = "temp",
+ batch_size: int = 1,
+ num_beams: int = 1,
+ use_cache: bool = True,
+ verbose: bool = False,
+ info: bool = False,
+ iterations: int = None,
+ warmup: int = None,
+ number: int = None,
+ duration: int = None,
+ percentile: int = None,
+ cleanup: bool = False,
+ action: str = None,
+ max_seq_len: int = None,
+ fp8: bool = True,
+ fp16: bool = False,
+ bf16: bool = False,
+ use_fp8_storage: bool = False,
+ quantize_bmms: bool = False,
+ input_seq_len: int = None,
+ output_seq_len: int = None,
+ nemo_model: str = None,
+ nemo_checkpoint: str = None,
+ nemo_hparams: str = None,
+ onnx_model: str = None,
+ **kwargs,
+ ) -> None:
+ """
+ Use Arguments from command line or user specified to setup config for the model.
+ """
+ self.validate_and_set_precision(fp8, fp16, bf16, use_fp8_storage, quantize_bmms)
+
+ if not torch.cuda.is_available():
+ raise EnvironmentError("GPU is required for NeMo demo.")
+
+ # Initialize CUDA Driver API
+ err, = cuda.cuInit(0)
+ if err != cuda.CUresult.CUDA_SUCCESS:
+ raise RuntimeError("Cuda initialization failed with error: {}".format(err))
+
+ # See https://pytorch.org/docs/stable/_modules/torch.html#set_float32_matmul_precision
+ torch.set_float32_matmul_precision('medium')
+
+ if max_seq_len != None:
+ self.nemo_cfg.model.max_seq_len = max_seq_len
+
+ assert action != None, "Action must be specified"
+ if action == "accuracy":
+ self.nemo_cfg.mode = "accuracy"
+ self.nemo_cfg.inference.compute_logprob = True
+ self.nemo_cfg.inference.all_probs = True
+ self.nemo_cfg.inference.greedy = True
+ self.nemo_cfg.inference.add_BOS = False
+ self.nemo_cfg.inference.tokens_to_generate = 1
+ self.nemo_cfg.inference.min_tokens_to_generate = 0
+ self.nemo_cfg.inference.temperature = 1.0
+ self.nemo_cfg.inference.top_k = 0
+ self.nemo_cfg.inference.top_p = 0.9
+ self.nemo_cfg.inference.repetition_penalty = 1.0
+ elif action == "benchmark":
+ self.nemo_cfg.mode = "benchmark"
+ if input_seq_len != None:
+ self.nemo_cfg.benchmark.input_seq_len = input_seq_len
+ if output_seq_len != None:
+ self.nemo_cfg.benchmark.output_seq_len = output_seq_len
+ self.nemo_cfg.inference.tokens_to_generate = self.nemo_cfg.benchmark.output_seq_len
+ self.nemo_cfg.inference.min_tokens_to_generate = self.nemo_cfg.benchmark.output_seq_len
+
+ if self.nemo_cfg.model.max_seq_len < (self.nemo_cfg.benchmark.input_seq_len + self.nemo_cfg.benchmark.output_seq_len):
+ raise ValueError(f"Max sequence length of the model needs to be greater than or equal to the sum of input sequence length and output sequence length. Got {self.nemo_cfg.model.max_seq_len} < {self.nemo_cfg.benchmark.input_seq_len} + {self.nemo_cfg.benchmark.output_seq_len}.")
+
+ if (nemo_model or nemo_checkpoint) and onnx_model:
+ raise RuntimeError(
+ "Both nemo-model and onnx-model cannot be specified together. Please specify either nemo-model or onnx-model."
+ )
+
+ assert variant in GPT3CONFIG_MAPPINGS
+ model_config = GPT3CONFIG_MAPPINGS[variant]
+
+ if self.nemo_cfg.model.max_seq_len > model_config.max_position_embeddings:
+ G_LOGGER.warn(
+ f"Updating max_position_embeddings to be the same as max_seq_len {self.nemo_cfg.model.max_seq_len}."
+ )
+ G_LOGGER.warn(
+ f"Outputs longer than {model_config.max_position_embeddings} might be unmeaningful."
+ )
+ model_config.max_position_embeddings = self.nemo_cfg.model.max_seq_len
+
+ if self.nemo_cfg.model.max_seq_len < model_config.min_seq_len:
+ G_LOGGER.warn(
+ f"Force updating max_seq_len to minimum required length {model_config.min_seq_len}."
+ )
+ self.nemo_cfg.model.max_seq_len = model_config.min_seq_len
+
+ self.nemo_cfg.batch_size = batch_size
+ self.nemo_cfg.use_cache = use_cache
+
+ if nemo_checkpoint != None:
+ # Set NeMo checkpoint configs
+ self.nemo_cfg.checkpoint_dir = os.path.dirname(nemo_checkpoint)
+ if not self.nemo_cfg.checkpoint_dir:
+ raise ValueError(f"NeMo checkpoint needs to be provided with full path.")
+ self.nemo_cfg.checkpoint_name = os.path.basename(nemo_checkpoint)
+ self.nemo_cfg.hparams_file = nemo_hparams
+ else:
+ if onnx_model != None:
+ G_LOGGER.info(f"Using onnx model {onnx_model} for inference.")
+ if os.path.exists(onnx_model):
+ self.nemo_cfg.onnx_model_file = onnx_model
+ else:
+ raise IOError(
+ f"Could not find the specified onnx file {onnx_model}."
+ )
+ else:
+ if nemo_model != None:
+ if os.path.exists(nemo_model):
+ self.nemo_cfg.gpt_model_file = nemo_model
+ else:
+ raise IOError(
+ f"Could not find the specified nemo file {nemo_model}."
+ )
+ else:
+ G_LOGGER.info("Downloading nemo model from HuggingFace Hub")
+ # Download nemo model if it does not exist.
+ # Setup temporary metadata, config to create a workspace to put the
+ # downloaded artefacts in
+ download_metadata = NetworkMetadata(
+ variant=variant,
+ precision=Precision(fp16=self.fp16),
+ use_cache=use_cache,
+ num_beams=num_beams,
+ batch_size=batch_size
+ )
+
+ download_config = self.config_class(metadata=download_metadata)
+ download_config.from_nemo_config(copy(self.nemo_cfg))
+ download_workspace = NNFolderWorkspace(download_config, working_dir)
+
+ self.nemo_cfg.gpt_model_file = download_model(
+ dst_dir=download_workspace.dpath + "/artefacts",
+ cache_dir=download_workspace.dpath + "/cache",
+ variant=variant,
+ fp8=fp8,
+ )
+
+ if self.nemo_cfg.gpt_model_file == None and self.nemo_cfg.checkpoint_dir == None and onnx_model == None:
+ G_LOGGER.error("No model exists based on specified configs and precisions.")
+ raise ValueError("Model not found.")
+
+ self.update_hyperparams(model_config)
+
+ # HuggingFace code
+ if verbose:
+ G_LOGGER.setLevel(level=G_LOGGER.DEBUG)
+ elif info:
+ G_LOGGER.setLevel(level=G_LOGGER.INFO)
+
+ if variant is None:
+ G_LOGGER.error("You need to specify --variant to run NeMo demo")
+ return
+
+ if self._args is not None:
+ G_LOGGER.info("Setting up environment with arguments: {}".format(self._args))
+ else:
+ G_LOGGER.info("User-customized API is called")
+
+ self.metadata = NetworkMetadata(
+ variant=variant,
+ precision=Precision(fp16=self.fp16),
+ use_cache=use_cache,
+ num_beams=num_beams,
+ batch_size=batch_size
+ )
+
+ self.config = self.config_class(
+ metadata = self.metadata
+ )
+
+ self.config.from_nemo_config(self.nemo_cfg)
+
+ self.workspace = NNFolderWorkspace(
+ self.config, working_dir
+ )
+
+ self.timing_profile = TimingProfile(
+ iterations=iterations,
+ number=number,
+ warmup=warmup,
+ duration=duration,
+ percentile=percentile,
+ )
+
+ self.keep_torch_model = not cleanup
+ self.keep_onnx_model = not cleanup
+ self.keep_trt_engine = not cleanup
+
+ self.process_framework_specific_arguments(onnx_model=onnx_model, **kwargs)
+
+ def process_framework_specific_arguments(self, **kwargs):
+ pass
+
+ def run(self) -> Union[List[NetworkResult], BenchmarkingResult]:
+ """
+ Main entry point of our function which compiles and generates our model data for command-line mode.
+ The general process for the commands are all the same:
+ (1) Download the model
+ (2) Run either checkpoint or benchmark
+ (3) Returns the result
+ """
+ t0 = time.time()
+ self.models = self.setup_tokenizer_and_model()
+ t1 = time.time()
+ G_LOGGER.info("setup_tokenizer_and_model() takes {:.4f}s in total.".format(t1 - t0))
+
+ results = []
+ ppl = None
+ random.seed(self.nemo_cfg.inference.seed)
+ np.random.seed(self.nemo_cfg.inference.seed)
+ torch.manual_seed(self.nemo_cfg.inference.seed)
+ if self.nemo_cfg.mode == "accuracy":
+ G_LOGGER.debug("Run in accuracy mode.")
+ eval_ppl = get_accuracy_metric(self.nemo_cfg.accuracy)
+ has_align_requirement = self.nemo_cfg.runtime == 'nemo' and hasattr(self.model.cfg, "fp8") and self.model.cfg.fp8 == True
+ if has_align_requirement and self.nemo_cfg.accuracy.tokens_to_generate > 1:
+ self.nemo_cfg.accuracy.tokens_to_generate = 1
+ G_LOGGER.warn("Force set tokens_to_generate=1 for FP8 run in NeMo framework.")
+ dataset = load_dataset(self.nemo_cfg.accuracy.dataset, self.workspace.rootdir, self.nemo_cfg.accuracy.tokens_to_generate, 8 if has_align_requirement else -1)
+ tokenizer = self.tokenizer
+
+ def eval_ppl_with_batch_input(eval_ppl, batch_input):
+ ds_input = dataset.preprocess_input(tokenizer, batch_input)
+ self.nemo_cfg.inference.tokens_to_generate = self.nemo_cfg.accuracy.tokens_to_generate
+ self.nemo_cfg.inference.min_tokens_to_generate = self.nemo_cfg.accuracy.tokens_to_generate
+
+ inputs = ds_input.inputs
+ response = full_inference(
+ model=self.model,
+ inputs=inputs,
+ cfg=self.nemo_cfg,
+ )
+
+ # It is still predication task even when tokens_to_generate > 1, so we need restore the context length.
+ batch_size = ds_input.inputs[0].shape[0]
+ real_ctx_length = ds_input.inputs[0].shape[1] - 1
+ inputs = (ds_input.inputs[0], torch.ones(batch_size, dtype=torch.int32) * real_ctx_length)
+
+ response = get_computeprob_response(tokenizer, response, inputs)
+ eval_ppl.update(ds_input=ds_input, response=response, tokenizer=tokenizer)
+
+ batch_input = []
+ for doc in tqdm(dataset.load()):
+ batch_input.append(doc)
+
+ if len(batch_input) == self.nemo_cfg.batch_size:
+ eval_ppl_with_batch_input(eval_ppl, batch_input)
+ batch_input.clear()
+
+ if len(batch_input):
+ # Pad empty text to batch size
+ while (len(batch_input) % self.nemo_cfg.batch_size) != 0:
+ batch_input.append({"text": ""})
+ eval_ppl_with_batch_input(eval_ppl, batch_input)
+
+ ppl, sequence_ppl, _, acc_text = eval_ppl.compute()
+ print("***************************")
+ print("{} ppl(last token): {:.4f}, ppl(sequence): {:.4f}, {}".format(self.nemo_cfg.accuracy.dataset, ppl, sequence_ppl, acc_text))
+ print("***************************")
+ elif self.nemo_cfg.mode == "benchmark":
+ G_LOGGER.debug("Run in benchmark mode.")
+ rand_input = get_random_input(self.model.tokenizer, self.nemo_cfg.batch_size, self.nemo_cfg.benchmark.input_seq_len, self.nemo_cfg.benchmark.output_seq_len)
+
+ for _ in range(self.timing_profile.warmup):
+ output = full_inference(self.model, rand_input, self.nemo_cfg)
+
+ class BenchmarkTimer:
+ def __init__(self, name):
+ self.name = name
+ self.started = False
+ self.start_time = None
+ self.times = []
+
+ def start(self):
+ assert not self.started
+ self.started = True
+ self.start_time = time.perf_counter()
+
+ def end(self):
+ assert self.started
+ self.started = False
+ self.times.append(time.perf_counter() - self.start_time)
+
+ def stats_str(self, num_tokens):
+ total_time = sum(self.times)
+ avg_time = total_time / float(len(self.times))
+ self.times.sort()
+ percentile95 = self.times[int(len(self.times) * 0.95)]
+ percentile99 = self.times[int(len(self.times) * 0.99)]
+ throughput = float(num_tokens) / avg_time
+ return("[{:10s}] Total Time: {:0.5f} s, Average Time: {:0.5f} s, 95th Percentile Time: {:0.5f} s, 99th Percentile Time: {:0.5f} s, Throughput: {:0.2f} tokens/s".format(self.name, total_time, avg_time, percentile95, percentile99, throughput))
+
+ G_LOGGER.info("Warm up finished. Start benchmarking...")
+ e2e_timer = BenchmarkTimer("E2E inference")
+ core_timer = BenchmarkTimer("Without tokenizer")
+ start_time = datetime.now()
+ iter_idx = 0
+ cur_duration = 0
+ while iter_idx < self.timing_profile.iterations or cur_duration < self.timing_profile.duration:
+ core_timer.start()
+ e2e_timer.start()
+ output = generate(self.model, rand_input, self.nemo_cfg)
+ core_timer.end()
+
+ output = process_output(self.model, output)
+ e2e_timer.end()
+
+ iter_idx += 1
+ cur_duration = (datetime.now() - start_time).total_seconds()
+
+ num_tokens = self.nemo_cfg.batch_size * self.nemo_cfg.benchmark.output_seq_len
+ print("***************************")
+ print(f"Running {iter_idx} iterations with duration: {cur_duration}s, batch size: {self.nemo_cfg.batch_size}, input sequence length: {self.nemo_cfg.benchmark.input_seq_len} and output sequence length: {self.nemo_cfg.benchmark.output_seq_len}")
+ print(f"{e2e_timer.stats_str(num_tokens)}")
+ print(f"{core_timer.stats_str(num_tokens)}")
+ print("***************************")
+ else:
+ G_LOGGER.debug("Run in inference mode.")
+ assert self.nemo_cfg.mode == "inference"
+ if self.nemo_cfg.runtime == 'nemo' and hasattr(self.model.cfg, "fp8") and self.model.cfg.fp8 == True and self.nemo_cfg.batch_size % 8 != 0:
+ new_batch_size = ((self.nemo_cfg.batch_size + 7) // 8) * 8
+ print("Update batch size from {} to {} for NeMo FP8 inference.".format(self.nemo_cfg.batch_size, new_batch_size))
+ self.nemo_cfg.batch_size = new_batch_size
+
+ nb_paddings = 0
+ while (len(self.nemo_cfg.prompts) % self.nemo_cfg.batch_size) != 0:
+ self.nemo_cfg.prompts.append(self.nemo_cfg.prompts[-1])
+ nb_paddings += 1
+
+ batch_idx = 0
+ start = 0
+ while True:
+ inputs = OmegaConf.to_container(listconfig.ListConfig(self.nemo_cfg.prompts[start:start+self.nemo_cfg.batch_size]))
+ output = full_inference(self.model, inputs, self.nemo_cfg)
+ output = remove_padded_prompts(output, nb_paddings)
+ print("***************************")
+ print("Batch {}: {}".format(batch_idx, output))
+ print("***************************")
+ batch_idx += 1
+ start += self.nemo_cfg.batch_size
+ if start >= len(self.nemo_cfg.prompts):
+ break
+
+ t2 = time.time()
+ G_LOGGER.info("Inference session is {:.4f}s in total.".format(t2 - t1))
+
+ # Release runtime objects
+ if self.nemo_cfg.runtime == 'onnx':
+ del self.model.onnxrt
+ elif self.nemo_cfg.runtime == 'trt':
+ del self.model.trt
+
+ return results, ppl
+
+ def add_args(self) -> None:
+ general_group = self._parser.add_argument_group("general")
+ general_group.add_argument(
+ "--help",
+ "-h",
+ help="Shows help message for NeMo commands.",
+ action="store_true",
+ )
+ general_group.add_argument(
+ "--verbose", "-v",
+ help="Display verbose logs.",
+ action="store_true"
+ )
+ general_group.add_argument(
+ "--info", help="Display info logs.", action="store_true"
+ )
+ general_group.add_argument(
+ "--working-dir", "-wd",
+ help="Location of where to save the model and other downloaded files.",
+ required=True,
+ )
+
+ timing_group = self._parser.add_argument_group("inference measurement")
+ timing_group.add_argument(
+ "--duration",
+ type=int,
+ help="Minimal duration of inference iterations to measure in seconds.",
+ default=NetworkCommand.DEFAULT_DURATION,
+ )
+ timing_group.add_argument(
+ "--iterations",
+ type=int,
+ help="Number of iterations to measure.",
+ default=NetworkCommand.DEFAULT_ITERATIONS,
+ )
+ timing_group.add_argument(
+ "--warmup",
+ type=int,
+ help="Number of warmup iterations before actual measurement occurs.",
+ default=NetworkCommand.DEFAULT_WARMUP,
+ )
+
+ model_config_group = self._parser.add_argument_group("model")
+ model_config_group.add_argument(
+ "--nemo-model",
+ help="Set a NeMo model to be used.",
+ type=str,
+ default=None
+ )
+ model_config_group.add_argument(
+ "--nemo-checkpoint",
+ help="Set a NeMo checkpoint to be used.",
+ type=str,
+ default=None
+ )
+ model_config_group.add_argument(
+ "--nemo-hparams",
+ help="Set a NeMo hparams.yaml to be used.",
+ type=str,
+ default=None
+ )
+ model_config_group.add_argument(
+ "--onnx-model",
+ help="Set a onnx model (exported from a NeMo model) to be used. See `export_utils.py` in the model directory for exporting onnx files",
+ type=str,
+ default=None,
+ )
+ model_config_group.add_argument(
+ "--max-seq-len",
+ help="Set maximum sequence lengths used for a GPT model.",
+ type=int,
+ default=None,
+ )
+ model_config_group.add_argument(
+ "--batch-size", "-b",
+ help="Set batch size for inference",
+ required=False,
+ type=int,
+ default=1
+ )
+ model_config_group.add_argument(
+ "--variant", "-m",
+ help="Model to generate",
+ required=True,
+ choices=GPT3ModelTRTConfig.TARGET_MODELS,
+ )
+ model_config_group.add_argument(
+ "--use-cache",
+ "-kv",
+ help="Enable KV cache",
+ action="store_true",
+ default=False,
+ )
+ model_config_group.add_argument(
+ "--fp8",
+ action="store_true",
+ help="Use FP8 precision.",
+ default=False
+ )
+ model_config_group.add_argument(
+ "--fp16",
+ action="store_true",
+ help="Use FP16 precision.",
+ default=False
+ )
+ model_config_group.add_argument(
+ "--bf16",
+ action="store_true",
+ help="Use BF16 precision.",
+ default=False
+ )
+ model_config_group.add_argument(
+ "--use-fp8-storage",
+ action="store_true",
+ help="Use FP8 storage precision.",
+ default=False
+ )
+ model_config_group.add_argument(
+ "--quantize-bmms",
+ help="Quantize attention BMMs",
+ action="store_true",
+ default=False,
+ )
+
+ def __call__(self):
+ t0 = time.time()
+ self.add_args()
+ self._args = self._parser.parse_args()
+ if "help" in self._args and self._args.help == True:
+ self._parser.print_help()
+ exit(0)
+
+ self.setup_environment(
+ **vars(self._args),
+ )
+ t1 = time.time()
+ G_LOGGER.info("Set up environment takes {:.4f}s.".format(t1 - t0))
+
+ network_results, ppl_results = self.run()
+ return NetworkCheckpointResult(
+ network_results=network_results,
+ accuracy=0,
+ perplexity=0,
+ )
diff --git a/demo/NeMo/nemo_export.py b/demo/NeMo/nemo_export.py
new file mode 100644
index 00000000..b9f5ad3a
--- /dev/null
+++ b/demo/NeMo/nemo_export.py
@@ -0,0 +1,922 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import argparse
+import subprocess as sp
+import shlex
+import omegaconf
+import os
+import sys
+import warnings
+from typing import Dict, List, Optional, Tuple
+import numpy as np
+
+# nemo
+from nemo.core import ModelPT
+from nemo.core.classes import Exportable
+from nemo.core.neural_types import ChannelType, NeuralType
+from nemo.utils.export_utils import augment_filename
+from nemo.collections.nlp.models.language_modeling.megatron_gpt_model import MegatronGPTModel, MegatronGPTExportableModel
+
+# onnx
+import onnx
+import onnx_graphsurgeon as gs
+
+# polygraphy
+from polygraphy.backend.trt import Profile, CreateConfig, engine_from_network, NetworkFromOnnxPath, save_engine
+from polygraphy.logger import G_LOGGER as PG_LOGGER
+
+import torch
+import transformer_engine
+
+if __name__ == "__main__":
+ filepath = os.path.dirname(os.path.abspath(__file__))
+ project_root = os.path.join(filepath, os.pardir, "HuggingFace")
+ sys.path.append(project_root)
+
+# Add syspath for custom library
+from GPT3.nemo_utils import load_nemo_model, release_nemo_model
+from GPT3.convert_te_onnx_to_trt_onnx import replace_customop_qdq_with_onnx_qdq
+
+# HuggingFace utils
+from NNDF.logger import G_LOGGER
+from NNDF.models import _calculate_polygraphy_verbosity
+
+# ONNX conversion script
+
+# Set polygraphy logging level here.
+PG_LOGGER.module_severity = PG_LOGGER.INFO
+
+class MegatronGPTSingleInputExportableModel(MegatronGPTExportableModel):
+ """
+ Wrapper for MegatronGPTExportableModel to export ONNX with a single input
+ """
+
+ def __init__(self, model, max_seq_len):
+ super().__init__(model)
+ self.cfg = model.cfg
+ self.max_seq_len = max_seq_len
+
+ def forward(self, tokens):
+ def model_forward(tokens):
+ position_ids, attention_mask = self.get_position_ids_and_mask(tokens, self.max_seq_len)
+ assert tokens.shape == position_ids.shape
+ assert attention_mask.shape[2] == attention_mask.shape[3] == tokens.shape[1] == position_ids.shape[1]
+ return self.model.forward(
+ tokens=tokens.cuda(),
+ text_position_ids=position_ids.cuda(),
+ attention_mask=attention_mask.cuda(),
+ labels=None,
+ )
+
+ with torch.no_grad(), torch.inference_mode(), torch.autocast(
+ 'cuda', dtype=self.dtype
+ ), warnings.catch_warnings():
+ warnings.filterwarnings(action='ignore', category=torch.jit.TracerWarning, module=r'.*')
+ if self.fp8_enabled:
+ with transformer_engine.pytorch.onnx_export(self.fp8_enabled), transformer_engine.pytorch.fp8_autocast(
+ enabled=self.fp8_enabled, fp8_recipe=self.fp8_recipe
+ ):
+ output_tensor = model_forward(tokens)
+ else:
+ output_tensor = model_forward(tokens)
+ return output_tensor
+
+ def get_position_ids_and_mask(self, data, max_seq_len):
+ seq_len = data.size()[1]
+ # Attention mask (lower triangular).
+ attention_mask = torch.tril(torch.ones(
+ (1, max_seq_len, max_seq_len), device=data.device)).view(
+ 1, 1, max_seq_len, max_seq_len)
+
+ # Position ids.
+ position_ids = torch.arange(max_seq_len, dtype=torch.long,
+ device=data.device)
+ position_ids = position_ids[:seq_len].unsqueeze(0).expand_as(data)
+
+ # Convert attention mask to binary:
+ attention_mask = (attention_mask < 0.5)
+
+ return position_ids, attention_mask[:1, :1, :seq_len, :seq_len]
+
+ def input_example(self):
+ ids = self.model.tokenizer.text_to_ids("how is the weather on Sunday morning?")
+ id_tensors = torch.unsqueeze(torch.LongTensor(ids), dim=0)
+ G_LOGGER.debug(f"Calling input_example shape {id_tensors.shape}")
+ return id_tensors, # return a tuple
+
+ @property
+ def input_types(self) -> Optional[Dict[str, NeuralType]]:
+ return {
+ "input_ids": NeuralType(('B', 'T'), ChannelType()),
+ }
+
+ @property
+ def input_names(self) -> List[str]:
+ return ['input_ids']
+
+def get_trtexec_cmd(onnx_fpath, cfg, bs):
+ max_seq_len = cfg.model.max_seq_len
+ opt_seq_len = cfg.trt_export_options.opt_seq_len if cfg.trt_export_options.opt_seq_len else (max_seq_len // 2)
+ trtexec_cmd = f"trtexec --onnx={onnx_fpath}"
+ min_shapes = f"--minShapes=input_ids:{bs}x1"
+ opt_shapes = f"--optShapes=input_ids:{bs}x{opt_seq_len}"
+ max_shapes = f"--maxShapes=input_ids:{bs}x{max_seq_len}"
+ if not cfg.use_one_input:
+ min_shapes += f",position_ids:{bs}x1"
+ opt_shapes += f",position_ids:{bs}x{opt_seq_len}"
+ max_shapes += f",position_ids:{bs}x{max_seq_len}"
+ if not cfg.trt_export_options.use_fp8:
+ min_shapes += ",attention_mask:1x1x1x1"
+ opt_shapes += f",attention_mask:1x1x{opt_seq_len}x{opt_seq_len}"
+ max_shapes += f",attention_mask:1x1x{max_seq_len}x{max_seq_len}"
+
+ if cfg.use_cache:
+ trtexec_cmd += " --profile=0"
+ nbheads, headsize = cfg.model.nb_heads, cfg.model.head_size
+ input_k = get_past_key_name('*')
+ input_v = get_past_value_name('*')
+ # ("sequence", "batch", nbheads, headsize)
+ min_shapes += f",{input_k}:0x{bs}x{nbheads}x{headsize},{input_v}:0x{bs}x{nbheads}x{headsize}"
+ opt_shapes += f",{input_k}:0x{bs}x{nbheads}x{headsize},{input_v}:0x{bs}x{nbheads}x{headsize}"
+ max_shapes += f",{input_k}:0x{bs}x{nbheads}x{headsize},{input_v}:0x{bs}x{nbheads}x{headsize}"
+ trtexec_cmd += f" {min_shapes} {opt_shapes} {max_shapes}"
+
+ if cfg.use_cache:
+ trtexec_cmd += " --profile=1"
+
+ min_shapes = f"--minShapes=input_ids:{bs}x1"
+ opt_shapes = f"--optShapes=input_ids:{bs}x1"
+ max_shapes = f"--maxShapes=input_ids:{bs}x1"
+ if not cfg.use_one_input:
+ min_shapes += f",position_ids:{bs}x1"
+ opt_shapes += f",position_ids:{bs}x1"
+ max_shapes += f",position_ids:{bs}x1"
+ if not cfg.trt_export_options.use_fp8:
+ min_shapes += ",attention_mask:1x1x1x1"
+ opt_shapes += f",attention_mask:1x1x{opt_seq_len}x{opt_seq_len}"
+ max_shapes += f",attention_mask:1x1x{max_seq_len}x{max_seq_len}"
+
+ nbheads, headsize = cfg.model.nb_heads, cfg.model.head_size
+ input_k = get_past_key_name('*')
+ input_v = get_past_value_name('*')
+ # ("sequence", "batch", nbheads, headsize)
+ min_shapes += f",{input_k}:1x{bs}x{nbheads}x{headsize},{input_v}:1x{bs}x{nbheads}x{headsize}"
+ opt_shapes += f",{input_k}:{opt_seq_len}x{bs}x{nbheads}x{headsize},{input_v}:{opt_seq_len}x{bs}x{nbheads}x{headsize}"
+ max_shapes += f",{input_k}:{max_seq_len - 1}x{bs}x{nbheads}x{headsize},{input_v}:{max_seq_len - 1}x{bs}x{nbheads}x{headsize}"
+ trtexec_cmd += f" {min_shapes} {opt_shapes} {max_shapes}"
+
+ use_tf32 = cfg.trt_export_options.use_tf32
+ use_fp8 = cfg.trt_export_options.use_fp8
+ use_fp16 = cfg.trt_export_options.use_fp16
+ use_bf16 = cfg.trt_export_options.use_bf16
+ use_strongly_typed = cfg.trt_export_options.use_strongly_typed
+ sparse = cfg.trt_export_options.sparse
+ trtexec_cmd += " --noTF32" if not use_tf32 else ""
+ trtexec_cmd += " --fp8" if (use_fp8 and not use_strongly_typed) else ""
+ trtexec_cmd += " --fp16" if (use_fp16 and not use_strongly_typed) else ""
+ trtexec_cmd += " --bf16" if (use_bf16 and not use_strongly_typed) else ""
+ trtexec_cmd += " --stronglyTyped" if use_strongly_typed else ""
+ trtexec_cmd += " --sparsity=enable" if sparse else ""
+ trtexec_cmd += " --timingCacheFile=functional.cache"
+ return trtexec_cmd
+
+
+def add_zero_point(g, base_name, dtype):
+ """Add Q/DQ zero-point constant"""
+ _zp_fp8_value = onnx.helper.make_tensor(base_name + "_zp_fp8_value", dtype, (1,), [0.0])
+ zero_point_fp8 = gs.Variable(base_name + "_zero_point", dtype=dtype, shape=(1,))
+ zero_point_const = gs.Node(op="Constant", name= base_name + "_zero_point_const", inputs=[], outputs=[zero_point_fp8], attrs={"value": _zp_fp8_value})
+ g.nodes.append(zero_point_const)
+ return zero_point_fp8
+
+
+def add_scale(g, base_name, dtype, value):
+ """Add Q/DQ scale constant"""
+ _scale_value = onnx.helper.make_tensor(base_name + "_scale_value", dtype, (1,), [value])
+ scale = gs.Variable(base_name + "_scale", dtype=dtype, shape=(1,))
+ scale_const = gs.Node(op="Constant", name=base_name + "_scale_const", inputs=[], outputs=[scale], attrs={"value": _scale_value})
+ g.nodes.append(scale_const)
+ return scale
+
+
+def add_cast(g, inp, outp_dtype, cast_name):
+ """Add Cast operator """
+ cast_outp = gs.Variable(cast_name+"_out", dtype=outp_dtype)
+ new_cast = gs.Node(
+ op="Cast",
+ name=cast_name,
+ inputs=[inp],
+ outputs=[cast_outp],
+ attrs={"to": outp_dtype}
+ )
+ g.nodes.append(new_cast)
+ return cast_outp
+
+
+def add_q(g, inp, hp_dtype, q_dtype, q_name=None):
+ """Add QuantizeLinear operator"""
+ scale_dtype = hp_dtype
+ q_name = q_name or f"{inp.name}_qfp8"
+ q_out = gs.Variable(q_name, dtype=q_dtype)
+ q = gs.Node(op="QuantizeLinear", name=q_name,
+ inputs=[
+ inp,
+ add_scale(g, inp.name, scale_dtype, 1.0),
+ add_zero_point(g, inp.name, q_dtype)
+ ],
+ outputs=[q_out])
+ g.nodes.append(q)
+ return q_out
+
+
+def add_dq(g, inp, hp_dtype, dq_dtype):
+ """Add DequantizeLinear operator"""
+ dq_name = f"{inp.name}_dqfp8"
+ scale_dtype = hp_dtype
+ dq_out = gs.Variable(dq_name, dtype=hp_dtype)
+ dq = gs.Node(op="DequantizeLinear", name=dq_name,
+ inputs=[
+ inp,
+ add_scale(g, inp.name, scale_dtype, 1.0),
+ add_zero_point(g, inp.name, dq_dtype)],
+ outputs=[dq_out])
+ g.nodes.append(dq)
+ return dq_out
+
+
+def quantize_all_bmms(g, dtype_high_prec, use_fp8_storage):
+ """Quantize the inputs of all batched matmul operators"""
+
+ def quantize_bmm(g, bmm, dtype_high_prec):
+ assert len(bmm.inputs) == 2
+ dq_outputs = []
+ for i in range(len(bmm.inputs)):
+ if i == 0 or not use_fp8_storage:
+ q_outp = add_q(g, bmm.inputs[i], dtype_high_prec, onnx.TensorProto.FLOAT8E4M3FN)
+ dq_out = add_dq(g, q_outp, dtype_high_prec, onnx.TensorProto.FLOAT8E4M3FN)
+ else:
+ # mm.inputs[1] is the input from K or V which we don't quantize if is stored
+ # in the cache in quantized type.
+ dq_out = add_dq(g, bmm.inputs[i], dtype_high_prec, onnx.TensorProto.FLOAT8E4M3FN)
+ dq_outputs.append(dq_out)
+ bmm.inputs = dq_outputs
+
+ bmm_nodes = [node for node in g.nodes if node.op == "MatMul"]
+ G_LOGGER.info("Quantizing attention BMMs")
+ G_LOGGER.info(f"Found {len(bmm_nodes)} MatMul operator nodes")
+ for bmm in bmm_nodes:
+ # Do not quantize the Matmul at the head of GPT3 (it is used )
+ if bmm.name == "/model/module/MatMul":
+ continue
+ quantize_bmm(g, bmm, dtype_high_prec)
+
+
+# Use ONNX graphsurgeon to add KV-cache to ONNX file
+# Reusing the HF demo names.
+def get_past_key_name(layer_id):
+ past_key_name = f"past_key_values.{layer_id}.decoder.key"
+ return past_key_name
+
+def get_past_value_name(layer_id):
+ past_value_name = f"past_key_values.{layer_id}.decoder.value"
+ return past_value_name
+
+def get_past_shape(nbheads, headsize):
+ return ("sequence_past_decoder_length", "batch", nbheads, headsize)
+
+def get_present_key_name(layer_id: int):
+ present_key_name = f"present_key_values.{layer_id}.decoder.key"
+ return present_key_name
+
+def get_present_value_name(layer_id: int):
+ present_value_name = f"present_key_values.{layer_id}.decoder.value"
+ return present_value_name
+
+def get_present_shape(nbheads, headsize):
+ return ("sequence_present_decoder_length", "batch", nbheads, headsize)
+
+def get_new_key_name(layer_id: int):
+ new_key_name = f"new_key_values.{layer_id}.decoder.key"
+ return new_key_name
+
+def get_new_value_name(layer_id: int):
+ new_value_name = f"new_key_values.{layer_id}.decoder.value"
+ return new_value_name
+
+def get_new_shape(nbheads, headsize):
+ return ("sequence", "batch", nbheads, headsize)
+
+def quantize_new_k_v(g, key_new, value_new, hp_dtype):
+ key_new_q_outp = add_q(g, key_new, hp_dtype, onnx.TensorProto.FLOAT8E4M3FN)
+ key_new_dq_out = add_dq(g, key_new_q_outp, hp_dtype, onnx.TensorProto.FLOAT8E4M3FN)
+ value_new_q_outp = add_q(g, value_new, hp_dtype, onnx.TensorProto.FLOAT8E4M3FN)
+ value_new_dq_out = add_dq(g, value_new_q_outp, hp_dtype, onnx.TensorProto.FLOAT8E4M3FN)
+ return key_new_dq_out, value_new_dq_out
+
+def add_kvcache_for(
+ g, layer_id, qkv_split, nbheads, headsize, dtype, kv_output_policy, hp_dtype, use_fp8_storage, quantize_bmms):
+ _, key_new, value_new = qkv_split.outputs
+ key_consumers = [c for c in key_new.outputs]
+ value_consumers = [c for c in value_new.outputs]
+
+ def add_graph_past_inputs(use_fp8_storage):
+ past_key = gs.Variable(
+ name=get_past_key_name(layer_id),
+ dtype=dtype,
+ shape=get_past_shape(nbheads, headsize))
+ past_value = gs.Variable(
+ name=get_past_value_name(layer_id),
+ dtype=dtype,
+ shape=get_past_shape(nbheads, headsize))
+ g.inputs.append(past_key)
+ g.inputs.append(past_value)
+
+ if use_fp8_storage and not quantize_bmms:
+ past_key_dq = add_dq(g, past_key, hp_dtype, onnx.TensorProto.FLOAT8E4M3FN)
+ past_value_dq = add_dq(g, past_value, hp_dtype, onnx.TensorProto.FLOAT8E4M3FN)
+ return past_key_dq, past_value_dq
+
+ return past_key, past_value
+
+ def add_concat(concat_name, input0, input1, output_name):
+ concat_out = gs.Variable(
+ output_name,
+ dtype=dtype,
+ shape=get_present_shape(nbheads, headsize))
+
+ concat = gs.Node(op="Concat", name=concat_name,
+ inputs=[input0, input1], outputs=[concat_out],
+ attrs={"axis": 0})
+ g.nodes.append(concat)
+ return concat_out
+
+ def add_cache_outputs(kv_output_policy, use_fp8_storage, hp_dtype):
+ if kv_output_policy == "kv_cache_concat":
+ new_key_output, new_value_output = key_concat_out, value_concat_out
+ elif kv_output_policy == "kv_new":
+ key_new.dtype = dtype
+ key_new.shape = get_new_shape(nbheads, headsize)
+ key_new.name = get_new_key_name(layer_id)
+ value_new.dtype = dtype
+ value_new.shape = get_new_shape(nbheads, headsize)
+ value_new.name = get_new_value_name(layer_id)
+
+ if use_fp8_storage:
+ key_new_q = add_q(g, key_new, hp_dtype, onnx.TensorProto.FLOAT8E4M3FN,
+ f"{key_new.name}_qfp8")
+ value_new_q = add_q(g, value_new, hp_dtype, onnx.TensorProto.FLOAT8E4M3FN,
+ f"{value_new.name}_qfp8")
+ new_key_output, new_value_output = key_new_q, value_new_q
+ else:
+ new_key_output, new_value_output = key_new, value_new
+ else:
+ raise ValueError(f"Unsupported kv_output_policy: {kv_output_policy}")
+ g.outputs.append(new_key_output)
+ g.outputs.append(new_value_output)
+ return new_key_output, new_value_output
+
+ past_key, past_value = add_graph_past_inputs(use_fp8_storage)
+ new_key_output, new_value_output = add_cache_outputs(kv_output_policy, use_fp8_storage, hp_dtype)
+
+ if quantize_bmms:
+ if use_fp8_storage:
+ key_new = new_key_output
+ value_new = new_value_output
+ else:
+ key_new, value_new = quantize_new_k_v(g, key_new, value_new, hp_dtype)
+ key_concat_out = add_concat(f"key.{layer_id}.concat",
+ past_key, key_new, get_present_key_name(layer_id))
+ value_concat_out = add_concat(f"value.{layer_id}.concat",
+ past_value, value_new, get_present_value_name(layer_id))
+
+ for c in key_consumers:
+ c.inputs[0] = key_concat_out
+ for c in value_consumers:
+ c.inputs[0] = value_concat_out
+
+
+def add_kvcache(g, nbheads, headsize, dtype, kv_output_policy, hp_dtype, use_fp8_storage, quantize_bmms):
+ """Add KV-cache to each Transformer layer's QKV split """
+ G_LOGGER.info("Adding KV-cache")
+ qkv_split_nodes = [node for node in g.nodes if node.op == "Split"]
+ G_LOGGER.debug(f"Found {len(qkv_split_nodes)} QKV-split nodes")
+
+ for layer_id, qkv_split in enumerate(qkv_split_nodes):
+ add_kvcache_for(
+ g, layer_id, qkv_split, nbheads, headsize, dtype, kv_output_policy, hp_dtype, use_fp8_storage, quantize_bmms)
+
+ G_LOGGER.debug("Done adding cache operations")
+ return len(qkv_split_nodes)
+
+
+def normalize_dyn_axes_to_hf_names(g, vocab_size):
+ g.inputs[0].name = "input_ids"
+ g.inputs[0].shape = ("batch", "sequence")
+ if len(g.inputs) > 1:
+ g.inputs[1].name = "position_ids"
+ g.inputs[1].shape = ("batch", "sequence")
+ g.outputs[0].name = "logits"
+ g.outputs[0].shape = ("batch", "sequence", vocab_size)
+ G_LOGGER.debug("Done normalizing dynamic axes names to HuggingFace demo names")
+
+
+def process_onnx(
+ kv_output_policy,
+ onnx_input_fpath,
+ onnx_output_fpath,
+ separate_param_files,
+ use_cache,
+ quantize_bmms,
+ nbheads, headsize, vocab_size, dtype, hp_dtype, use_fp8_storage):
+ """
+ Process an ONNX model, add KV cache inputs and output, save result model to a specified path.
+ """
+ G_LOGGER.info(f"Importing {onnx_input_fpath}... this will take some time")
+ g = gs.import_onnx(onnx.load(onnx_input_fpath))
+ normalize_dyn_axes_to_hf_names(g, vocab_size)
+ num_layers = 0
+ if use_cache:
+ num_layers = add_kvcache(g, nbheads, headsize, dtype, kv_output_policy, hp_dtype, use_fp8_storage, quantize_bmms)
+ g.cleanup().toposort()
+
+ if quantize_bmms:
+ quantize_all_bmms(g, hp_dtype, use_fp8_storage)
+ g.cleanup().toposort()
+
+ G_LOGGER.info(f"Exporting {onnx_output_fpath}")
+ model = gs.export_onnx(g)
+ G_LOGGER.info(f"Saving {onnx_output_fpath}")
+ if separate_param_files:
+ onnx.save_model(model, onnx_output_fpath, save_as_external_data=True,
+ all_tensors_to_one_file = False, convert_attribute=False)
+ else:
+ onnx.save_model(model, onnx_output_fpath, save_as_external_data=False)
+ G_LOGGER.info(f"Done: {onnx_output_fpath}")
+ return num_layers
+
+
+def create_dir_if_not_exist(path):
+ dir = os.path.dirname(path)
+ if not os.path.exists(dir) and dir != "":
+ G_LOGGER.info(f"Making directory {dir}")
+ os.makedirs(dir)
+
+
+class NeMoConverter():
+ """
+ A class to convert a NeMo model to an ONNX file, and convert an ONNX file to a TensorRT engine.
+ """
+ def __init__(self, cfg, model_type=ModelPT):
+ self.model_type = model_type
+ self.cfg = cfg
+ self.model = None
+ self.export_envvars()
+
+ def export_envvars(self) -> None:
+ if self.cfg.trt_export_options.use_fp8:
+ G_LOGGER.info(
+ f"Setting max sequence length to {self.cfg.model.max_seq_len}"
+ )
+ os.environ["NVTE_ONNX_KVCACHE_MAX_SEQ_LEN"] = str(
+ self.cfg.model.max_seq_len
+ )
+
+ def nemo_to_onnx(self) -> str:
+ """
+ Convert a NeMo model to an ONNX model, return the file path to the ONNX model.
+ """
+ if self.model == None:
+ self.model = load_nemo_model(self.cfg, self.model_type)
+
+ if not isinstance(self.model, Exportable):
+ G_LOGGER.error("Your NeMo model class ({}) is not Exportable.".format(self.model.__class__.__name__))
+ sys.exit(1)
+
+ if hasattr(self.model.cfg, "fp8") and self.model.cfg.fp8 == True:
+ if self.cfg.trt_export_options.use_fp8 == False:
+ G_LOGGER.info("Turning on trt_export_options.use_fp8 because NeMo model is in FP8 precision.")
+ self.cfg.trt_export_options.use_fp8 = True
+ else:
+ if self.cfg.trt_export_options.use_fp8 == True:
+ G_LOGGER.info("Turning off trt_export_options.use_fp8 because NeMo model is not in FP8 precision.")
+ self.cfg.trt_export_options.use_fp8 = False
+
+ onnx_out = self.cfg.onnx_model_file
+ create_dir_if_not_exist(onnx_out)
+ check_trace = self.cfg.onnx_export_options.runtime_check
+ onnx_names = []
+
+ dynamic_axes={
+ 'input_ids': {0: "batch", 1: "sequence"},
+ 'position_ids': {0: "batch", 1: "sequence"},
+ 'logits': {0: "batch", 1: "sequence"},
+ }
+
+ if self.cfg.use_one_input:
+ # Use a wrapper class to get rid of inputs other than input_ids.
+ self.model = MegatronGPTSingleInputExportableModel(self.model, self.cfg.model.max_seq_len)
+ del dynamic_axes['position_ids']
+
+ try:
+ self.model.to(device=self.cfg.onnx_export_options.device).freeze()
+ self.model.eval()
+ if not self.cfg.trt_export_options.use_fp8:
+ G_LOGGER.info("Exporting ONNX with attention_mask")
+ dynamic_axes['attention_mask'] = {2: "sequence", 3: "sequence"}
+
+ self.model.export(
+ onnx_out,
+ onnx_opset_version=self.cfg.onnx_export_options.onnx_opset,
+ do_constant_folding=self.cfg.onnx_export_options.do_constant_folding,
+ dynamic_axes=dynamic_axes,
+ check_trace=check_trace,
+ check_tolerance=self.cfg.onnx_export_options.check_tolerance,
+ verbose=self.cfg.onnx_export_options.verbose,
+ )
+ onnx_names = [augment_filename(onnx_out, subnet_name) for subnet_name in self.model.list_export_subnets()]
+
+ except Exception as e:
+ G_LOGGER.error(
+ "Export failed. Please make sure your NeMo model class ({}) has working export() and that you have the latest NeMo package installed with [all] dependencies.".format(
+ self.model.__class__
+ )
+ )
+ raise e
+
+ release_nemo_model(self.model)
+ assert len(onnx_names) == 1
+ os.rename(onnx_names[0], onnx_out)
+ return onnx_out
+
+ def prune_onnx(self, input_path) -> str:
+ """
+ Prune the input ONNX model to be structured sparsity pattern by using polygraphy.
+ """
+ if not self.cfg.trt_export_options.sparse:
+ G_LOGGER.warning(f"Model pruning is enabled but sparsity is not enabled for TRT engine builder.")
+
+ ibname = os.path.basename(input_path)
+ obname = "pruned." + ibname
+ opath = os.path.join(os.path.dirname(input_path), obname)
+ o_data_real_path = opath + "_data"
+ if os.path.exists(opath) and os.path.exists(o_data_real_path):
+ return opath
+
+ o_data_bname = os.path.basename(o_data_real_path)
+ cmds = f"polygraphy surgeon prune {input_path} -o {opath} --save-external-data {o_data_bname}"
+ G_LOGGER.info(f"Prune ONNX model with: {cmds}")
+ G_LOGGER.info(f"This may take a while...")
+ sp.run(shlex.split(cmds), check=True, stdout=sp.PIPE, stderr=sp.STDOUT)
+ return opath
+
+
+ def create_onnx(self, onnx_input_fpath, onnx_output_fpath, kv_output_policy="kv_new"):
+ """
+ Create an ONNX model with modifications from `onnx_input_fpath`, save the ONNX model to `onnx_output_fpath`.
+ The ONNX is modified to use a KV-Cache and/or quantize the attention batched matrix-multiplication ops.
+ No return value for this function.
+ """
+ assert os.path.splitext(onnx_input_fpath)[1] == ".onnx", "Input ONNX file must end with '.onnx'."
+ assert os.path.splitext(onnx_output_fpath)[1] == ".onnx", "Output ONNX file must end with '.onnx'."
+
+ quantize_bmms = self.cfg.onnx_export_options.quantize_bmms
+ use_cache = self.cfg.use_cache
+ nbheads, headsize = self.cfg.model.nb_heads, self.cfg.model.head_size
+ hp_dtype = onnx.TensorProto.BFLOAT16 if self.cfg.trt_export_options.use_bf16 else onnx.TensorProto.FLOAT16
+ dtype = hp_dtype
+ if self.cfg.onnx_export_options.use_fp8_storage:
+ dtype = onnx.TensorProto.FLOAT8E4M3FN
+ assert nbheads * headsize == self.cfg.model.hidden_size, "Model hidden size does not match."
+ num_qkvs = process_onnx(kv_output_policy,
+ onnx_input_fpath, onnx_output_fpath, separate_param_files=True,
+ use_cache=use_cache, quantize_bmms=quantize_bmms,
+ nbheads=nbheads, headsize=headsize, vocab_size=self.cfg.model.vocab_size, dtype=dtype, hp_dtype=hp_dtype, use_fp8_storage=self.cfg.onnx_export_options.use_fp8_storage)
+
+ G_LOGGER.info(f"Number of QKV subgraphs = {num_qkvs}, number of layers = {self.cfg.model.num_layers}")
+ if num_qkvs != self.cfg.model.num_layers:
+ raise ValueError("Number of QKV subgraphs must be the same as number of layers in the model.")
+ G_LOGGER.info(f"Saved KV-cache onnx to {onnx_output_fpath}")
+
+
+ # Reads an onnx file and creates a trt engine file
+ def onnx_to_trt(self, onnx_fpath, trt_fpath):
+ """
+ Convert an ONNX model from `onnx_fpath` to a TensorRT engine, and save the result to `trt_fpath`.
+ """
+ # Set up polygraphy config
+ use_tf32 = self.cfg.trt_export_options.use_tf32
+ use_fp16 = self.cfg.trt_export_options.use_fp16
+ use_fp8 = self.cfg.trt_export_options.use_fp8
+ use_bf16 = self.cfg.trt_export_options.use_bf16
+ strongly_typed = self.cfg.trt_export_options.use_strongly_typed
+ sparse = self.cfg.trt_export_options.sparse
+ if sparse and not self.cfg.onnx_export_options.prune:
+ G_LOGGER.warning("Sparsity for TRT engine builder is enabled, but model pruning is not.")
+
+ # Create optimization profiles
+ bs = self.cfg.batch_size
+ max_seq_len = self.cfg.model.max_seq_len
+ opt_seq_len = self.cfg.trt_export_options.opt_seq_len if self.cfg.trt_export_options.opt_seq_len else (max_seq_len // 2)
+ profile_non_kv = Profile()
+ profile_non_kv.add(name="input_ids", min=(bs, 1), opt=(bs, opt_seq_len), max=(bs, max_seq_len)) # (batch, sequence)
+ if not self.cfg.use_one_input:
+ profile_non_kv.add(name="position_ids", min=(bs, 1), opt=(bs, opt_seq_len), max=(bs, max_seq_len)) # (batch, sequence)
+ # For FP8 precision, attention mask is created inside transformer_engine.
+ if not self.cfg.trt_export_options.use_fp8:
+ profile_non_kv.add(name="attention_mask", min=(1, 1, 1, 1), opt=(1, 1, opt_seq_len, opt_seq_len), max=(1, 1, max_seq_len, max_seq_len)) # (1, 1, sequence, sequence)
+
+ num_layers, nbheads, headsize = self.cfg.model.num_layers, self.cfg.model.nb_heads, self.cfg.model.head_size
+ if self.cfg.use_cache:
+ for i in range(num_layers):
+ input_k = get_past_key_name(i)
+ input_v = get_past_value_name(i)
+ # (sequence, batch, nbheads, headsize)
+ profile_non_kv.add(name=input_k, min=(0, bs, nbheads, headsize), opt=(0, bs, nbheads, headsize), max=(0, bs, nbheads, headsize))
+ profile_non_kv.add(name=input_v, min=(0, bs, nbheads, headsize), opt=(0, bs, nbheads, headsize), max=(0, bs, nbheads, headsize))
+
+ profiles = [profile_non_kv]
+
+ # When enabling KV-cache, use first profile for context phase and second profile for generation phase
+ if self.cfg.use_cache:
+ profile_kv = Profile()
+ profile_kv.add(name="input_ids", min=(bs, 1), opt=(bs, 1), max=(bs, 1)) # (batch, sequence)
+ if not self.cfg.use_one_input:
+ profile_kv.add(name="position_ids", min=(bs, 1), opt=(bs, 1), max=(bs, 1)) # (batch, sequence)
+ # For FP8 precision, attention mask is created inside transformer_engine.
+ if not self.cfg.trt_export_options.use_fp8:
+ profile_kv.add(name="attention_mask", min=(1, 1, 1, 1), opt=(1, 1, opt_seq_len, opt_seq_len), max=(1, 1, max_seq_len, max_seq_len)) # (1, 1, sequence, sequence)
+
+ assert num_layers > 0
+ nbheads, headsize = self.cfg.model.nb_heads, self.cfg.model.head_size
+ for i in range(num_layers):
+ input_k = get_past_key_name(i)
+ input_v = get_past_value_name(i)
+ # (sequence, batch, nbheads, headsize)
+ profile_kv.add(name=input_k, min=(1, bs, nbheads, headsize), opt=(opt_seq_len, bs, nbheads, headsize), max=(max_seq_len-1, bs, nbheads, headsize))
+ profile_kv.add(name=input_v, min=(1, bs, nbheads, headsize), opt=(opt_seq_len, bs, nbheads, headsize), max=(max_seq_len-1, bs, nbheads, headsize))
+ profiles = [profile_kv, profile_non_kv]
+
+
+ # Read about these arguments here:
+ # https://github.com/NVIDIA/TensorRT/blob/main/tools/Polygraphy/polygraphy/backend/trt/config.py
+ # Note that the precision args below *enable*, not *require*, the specified precision
+ preview_features = []
+
+ trt_config = CreateConfig(
+ tf32= use_tf32,
+ fp16=False if strongly_typed else use_fp16,
+ bf16=False if strongly_typed else use_bf16,
+ sparse_weights=sparse,
+ profiles=profiles,
+ precision_constraints=None if strongly_typed else "obey",
+ preview_features=preview_features,
+ fp8=False if strongly_typed else use_fp8,
+ load_timing_cache=self.cfg.trt_export_options.timing_cache,
+ )
+
+ # Print out trtexec command for debugging
+ G_LOGGER.debug(" >>> trtexec command for debugging:")
+ G_LOGGER.debug(get_trtexec_cmd(onnx_fpath, self.cfg, bs))
+
+ with PG_LOGGER.verbosity(_calculate_polygraphy_verbosity()):
+ G_LOGGER.info(f"Reading ONNX file at {onnx_fpath}")
+ network = NetworkFromOnnxPath(onnx_fpath, strongly_typed=strongly_typed)
+ G_LOGGER.info("Building TRT engine")
+ engine = engine_from_network(network, config=trt_config)
+ G_LOGGER.info(f"Saving TRT engine to {trt_fpath}")
+ save_engine(engine, trt_fpath)
+
+ @staticmethod
+ def _resolve_opset19_paths(onnx_fpath, results_path: Optional[str] = None) -> str:
+ foldername, filename = os.path.split(onnx_fpath)
+ return foldername if not results_path else results_path, filename
+
+ @staticmethod
+ def get_opset19_onnx_fpath(onnx_fpath, results_path: Optional[str] = None) -> str:
+ suffix = ".opset19.onnx"
+ results_path, filename = NeMoConverter._resolve_opset19_paths(
+ onnx_fpath, results_path
+ )
+ return os.path.join(results_path, os.path.splitext(filename)[0] + suffix)
+
+
+ @staticmethod
+ def onnx_to_opset19(onnx_fpath, results_path: Optional[str] = None) -> str:
+ """
+ Convert a ONNX model `onnx_fpath` to be with standard opset19 Q/DQ nodes, return a string
+ contains a file path to the result ONNX if any conversion is performed, otherwise return `None`.
+ """
+ mappings = replace_customop_qdq_with_onnx_qdq(
+ [onnx_fpath],
+ NeMoConverter._resolve_opset19_paths(onnx_fpath, results_path)[0],
+ create_netron_compatible_model=False,
+ remove_cast_before_q=False,
+ remove_cast_after_dq=False,
+ change_qdq_scale_precision="",
+ )
+ if (
+ (not mappings)
+ or (onnx_fpath not in mappings)
+ or (mappings[onnx_fpath] == None)
+ ):
+ G_LOGGER.error(f"Opset19 onnx file conversion failed for {onnx_fpath}.")
+ assert False
+
+ G_LOGGER.info(f"Converted {onnx_fpath} to {mappings[onnx_fpath]} for opset19.")
+ return mappings[onnx_fpath]
+
+def parse_args():
+ parser = argparse.ArgumentParser(description='NeMo export script arguments', add_help=True)
+ parser.add_argument(
+ "--nemo-model",
+ help="Set a NeMo model to be used.",
+ required=False,
+ default=None,
+ type=str,
+ )
+ parser.add_argument(
+ "--nemo-checkpoint",
+ help="Set a NeMo checkpoint to be used.",
+ required=False,
+ default=None,
+ type=str,
+ )
+ parser.add_argument(
+ "--onnx-model",
+ help="A path to load an ONNX model for conversion.",
+ required=False,
+ default=None,
+ type=str,
+ )
+ parser.add_argument(
+ "--save-onnx-dir",
+ help="A directory to save the generated ONNX model. Must be writable.",
+ required=True,
+ )
+ parser.add_argument(
+ "--opset19",
+ action="store_true",
+ help="If set, the ONNX will be converted to opset19.",
+ default=False
+ )
+ parser.add_argument(
+ "--use-cache",
+ action="store_true",
+ help="If set, the ONNX will have KV-cache inputs and outputs.",
+ default=False
+ )
+ parser.add_argument(
+ "--quantize-bmms",
+ help="Quantize attention BMMs",
+ action="store_true",
+ default=False,
+ )
+ parser.add_argument(
+ "--save-engine",
+ required=False,
+ help="If set to a path, a TensorRT engine will be built from ONNX and save to the path.",
+ )
+ parser.add_argument(
+ "--fp8",
+ action="store_true",
+ help="Use FP8 precision during conversion.",
+ default=False
+ )
+ parser.add_argument(
+ "--fp16",
+ action="store_true",
+ help="Use FP16 precision during conversion.",
+ default=False
+ )
+ parser.add_argument(
+ "--bf16",
+ action="store_true",
+ help="Use BF16 precision during conversion.",
+ default=False
+ )
+ parser.add_argument(
+ "--extra-configs",
+ required=False,
+ help='Use this flag to set fields specified in config.yml with a format of --extra-configs="[=][ =]*". Values specified by this flag will not override any value set from other flags.',
+ default=None,
+ type=str,
+ )
+ args = parser.parse_args()
+ return args
+
+def main():
+ G_LOGGER.setLevel(level=G_LOGGER.INFO)
+
+ config_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "config.yaml")
+ cfg = omegaconf.OmegaConf.load(config_path)
+ G_LOGGER.info(f"Loaded configs = {cfg}")
+
+ args = parse_args()
+ if (args.nemo_model != None or args.nemo_checkpoint != None) and args.onnx_model != None:
+ G_LOGGER.error("NeMo model and ONNX model cannot be both set.")
+ exit(1)
+
+ if args.nemo_model == None and args.nemo_checkpoint == None and args.onnx_model == None:
+ G_LOGGER.error("Either one of --nemo-model, --nemo-checkpoint, or --onnx-model needs to be set.")
+ exit(1)
+
+ if args.extra_configs != None:
+ kwargs = args.extra_configs.split(" ")
+ for kwarg in kwargs:
+ kw = kwarg.split("=")
+ if len(kw) != 2:
+ raise ValueError(f'Arg {kwarg} is not in a format of "="')
+ def nested_set(dic, keys, value):
+ for i in range(len(keys)):
+ if not hasattr(dic, keys[i]):
+ raise ValueError(f"Cannot find key {keys[:i+1]} in the config.")
+ if i == len(keys) - 1:
+ dic[keys[i]] = value
+ else:
+ dic = dic[keys[i]]
+
+ G_LOGGER.info(f"Setting {kw[0]} to {kw[1]}")
+ nested_set(cfg, kw[0].split("."), kw[1])
+ G_LOGGER.info(f"Modified Configs = {cfg}")
+
+ # Set precision for conversion
+ if args.fp16:
+ cfg.trainer.precision = "16"
+ cfg.trt_export_options.use_fp16 = True
+ elif args.bf16:
+ cfg.trainer.precision = "bf16"
+ cfg.trt_export_options.use_bf16 = True
+ else:
+ cfg.trainer.precision = "32"
+
+ if args.fp8:
+ cfg.trt_export_options.use_fp8 = True
+
+ if args.quantize_bmms:
+ cfg.onnx_export_options.quantize_bmms = True
+
+ if os.path.exists(args.save_onnx_dir) and not os.path.isdir(args.save_onnx_dir):
+ raise ValueError(f"{args.save_onnx_dir} is not a directory.")
+
+ cfg.onnx_model_file = os.path.join(args.save_onnx_dir, "model.onnx")
+ create_dir_if_not_exist(cfg.onnx_model_file)
+
+ # Convert NeMo model to ONNX model
+ converter = None
+ if args.nemo_model or args.nemo_checkpoint:
+ cfg.gpt_model_file = args.nemo_model
+ if args.nemo_checkpoint:
+ cfg.checkpoint_dir = os.path.dirname(args.nemo_checkpoint)
+ cfg.checkpoint_name = os.path.basename(args.nemo_checkpoint)
+ converter = NeMoConverter(cfg, MegatronGPTModel)
+ onnx_name = converter.nemo_to_onnx()
+ G_LOGGER.info(f"ONNX exported from NeMo {onnx_name}")
+ elif args.onnx_model:
+ onnx_name = args.onnx_model
+
+ # Convert Q/DQ nodes to use standard opset19 operators
+ if args.opset19:
+ op19_onnx = NeMoConverter.onnx_to_opset19(onnx_name, args.save_onnx_dir)
+ if op19_onnx != None:
+ G_LOGGER.info(f"Get opset19 onnx file {op19_onnx}")
+ onnx_name = op19_onnx
+
+ # Add KV cache to ONNX model
+ if cfg.use_cache:
+ G_LOGGER.info(f"Converting {onnx_name} with KV-cache support")
+ kv_output_policy = "kv_new"
+ new_dir = os.path.join(args.save_onnx_dir, f"{kv_output_policy}")
+ onnx_output_fpath = os.path.join(new_dir, onnx_name.split("/")[-1])
+ create_dir_if_not_exist(onnx_output_fpath)
+ if not converter:
+ converter = NeMoConverter(cfg, MegatronGPTModel)
+ converter.create_onnx(onnx_name, onnx_output_fpath, kv_output_policy)
+ onnx_name = onnx_output_fpath
+
+ if cfg.onnx_export_options.prune:
+ onnx_name = converter.prune_onnx(onnx_name)
+
+ # Convert ONNX model to TRT engine
+ if args.save_engine:
+ create_dir_if_not_exist(args.save_engine)
+ if not converter:
+ converter = NeMoConverter(cfg, MegatronGPTModel)
+ converter.onnx_to_trt(onnx_name, args.save_engine)
+
+if __name__ == '__main__':
+ main()
diff --git a/demo/NeMo/patch_te.sh b/demo/NeMo/patch_te.sh
new file mode 100644
index 00000000..4f060dd8
--- /dev/null
+++ b/demo/NeMo/patch_te.sh
@@ -0,0 +1,41 @@
+#!/bin/sh
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Sourcing messes up the directory detection with readlink.
+if [ ! "${0##*/}" = "patch_te.sh" ]; then
+ echo "Please run this patch script, don't source it." >&2
+ return 1
+fi
+
+NEMO_DIR=$(dirname "$(readlink -f "$0")")
+
+te_loc="$(pip show transformer_engine | grep '^Location' | awk '{print $2}')"
+cd "${te_loc}/transformer_engine" || {
+ echo "Could not locate transformer-engine python package. Please check if installation proceeded correctly."
+ exit 1
+}
+# Use sys.executable when calling pip within subprocess to recognize virtualenv.
+# If patch is already applied, skip it and proceed with the rest of the script, quit otherwise.
+# NOTE: patch needs to be updated to track the commit of TE in install.sh.
+OUT="$(patch --forward common/__init__.py <"${NEMO_DIR}"/transformer_engine.patch)" || echo "${OUT}" | grep "Skipping patch" -q || {
+ echo "Could not patch transformer engine because ${OUT}"
+ exit 1
+}
+unset OUT
+cd - || exit
+unset te_loc
diff --git a/demo/NeMo/requirements.txt b/demo/NeMo/requirements.txt
new file mode 100644
index 00000000..c715ed76
--- /dev/null
+++ b/demo/NeMo/requirements.txt
@@ -0,0 +1,13 @@
+nemo-toolkit[nlp]==1.17.0
+onnx==1.14.0
+protobuf==3.20.3
+onnxruntime==1.13.1
+transformers==4.27.0
+cuda-python==12.1.0
+setuptools==65.5.1
+tqdm
+--pre --extra-index-url https://download.pytorch.org/whl/cu121
+torch==2.1.0
+torchaudio==2.1.0
+torchvision==0.16.0
+onnx-graphsurgeon==0.3.27
diff --git a/demo/NeMo/run.py b/demo/NeMo/run.py
new file mode 100644
index 00000000..5ba00b5a
--- /dev/null
+++ b/demo/NeMo/run.py
@@ -0,0 +1,200 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+Demonstrates TensorRT capabilities with networks trained by NeMo.
+Requires Python 3.6+
+"""
+
+import argparse
+import os
+import sys
+from typing import List, Tuple
+
+ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
+sys.path.append(ROOT_DIR)
+
+sys.path.append('../') # Include one-level up directory so to reuse HuggingFace utils.
+from HuggingFace.run import (
+ Action,
+ NetworkScriptAction,
+ WRAPPER_LIST_ACTION,
+)
+from HuggingFace.NNDF.logger import G_LOGGER
+from HuggingFace.NNDF.general_utils import register_network_folders
+from HuggingFace.NNDF.cuda_bootstrapper import bootstrap_ld_library_path
+
+WRAPPER_RUN_ACTION = "run"
+WRAPPER_ACCURACY_ACTION = "accuracy"
+WRAPPER_BENCHMARK_ACTION = "benchmark"
+WRAPPER_ACTIONS = [WRAPPER_LIST_ACTION, WRAPPER_RUN_ACTION, WRAPPER_ACCURACY_ACTION, WRAPPER_BENCHMARK_ACTION]
+
+class ListAction(Action):
+ def __init__(self, networks: List[str], parser: argparse.ArgumentParser):
+ super().__init__(networks, parser)
+ self.networks = networks
+
+ def execute(self, args: argparse.Namespace):
+ print("Networks that are supported by NeMo Demo:")
+ [print(n) for n in self.networks]
+ return 0
+
+class RunAction(NetworkScriptAction):
+ def execute(self, args: argparse.Namespace):
+ module = self.load_script(args.script, args)
+ module.RUN_CMD._parser = self.parser
+
+ old_path = os.getcwd()
+ # Execute script in each relevant folder
+ try:
+ os.chdir(args.network)
+ _ = module.RUN_CMD()
+ finally:
+ os.chdir(old_path)
+
+ return 0
+
+ def add_args(self, parser: argparse.ArgumentParser):
+ super().add_args(parser)
+ run_group = parser.add_argument_group("run args")
+ run_group.add_argument("script", choices=self.PER_NETWORK_SCRIPTS)
+
+class BenchmarkAction(NetworkScriptAction):
+ def execute(self, args: argparse.Namespace):
+ module = self.load_script(args.script, args)
+ module.RUN_CMD._parser = self.parser
+
+ old_path = os.getcwd()
+ # Execute script in each relevant folder
+ try:
+ os.chdir(args.network)
+ _ = module.RUN_CMD()
+ finally:
+ os.chdir(old_path)
+
+ return 0
+
+ def add_args(self, parser: argparse.ArgumentParser):
+ super().add_args(parser)
+ benchmarking_group = parser.add_argument_group("benchmark args")
+ benchmarking_group.add_argument("script", choices=self.PER_NETWORK_SCRIPTS)
+ benchmarking_group.add_argument(
+ "--input-seq-len",
+ type=int,
+ help="Specify fixed input sequence length for perf benchmarking. Required for benchmark except when both input_profile_max and output_profile_max are provided for trt",
+ )
+ benchmarking_group.add_argument(
+ "--output-seq-len",
+ type=int,
+ help="Specify fixed output sequence length for perf benchmarking. Required for benchmark except when both input_profile_max and output_profile_max are provided for trt",
+ )
+
+class AccuracyAction(NetworkScriptAction):
+ def execute(self, args: argparse.Namespace):
+ module = self.load_script(args.script, args)
+ module.RUN_CMD._parser = self.parser
+
+ old_path = os.getcwd()
+ # Execute script in each relevant folder
+ try:
+ os.chdir(args.network)
+ _ = module.RUN_CMD()
+ finally:
+ os.chdir(old_path)
+
+ return 0
+
+ def add_args(self, parser: argparse.ArgumentParser):
+ super().add_args(parser)
+ accuracy_group = parser.add_argument_group("accuracy args")
+ accuracy_group.add_argument("script", choices=self.PER_NETWORK_SCRIPTS)
+ accuracy_group.add_argument(
+ "--task",
+ type=str,
+ default="lambada",
+ choices=["lambada"],
+ help="Specify which task to be used for accuracy check.",
+ )
+
+def get_action(
+ action_name: str, networks: List[str], parser: argparse.ArgumentParser
+) -> Action:
+ return {
+ WRAPPER_LIST_ACTION: ListAction,
+ WRAPPER_RUN_ACTION: RunAction,
+ WRAPPER_BENCHMARK_ACTION: BenchmarkAction,
+ WRAPPER_ACCURACY_ACTION: AccuracyAction,
+ }[action_name](networks, parser)
+
+def verify_python_version():
+ if sys.version_info.major < 3 or sys.version_info.minor <= 6:
+ raise RuntimeError("NeMo OSS Demo does not support Python <= 3.6 due to end-of-life.")
+ if sys.version_info.major < 3 or sys.version_info.minor < 8 or (sys.version_info.minor == 8 and sys.version_info.micro < 10):
+ G_LOGGER.warn("NeMo OSS Demo is not tested for Python < 3.8.10")
+
+def get_default_parser(
+ description: str = "", add_default_help=False
+) -> Tuple[argparse.ArgumentParser, bool]:
+ """
+ Returns argparser for use by main(). Allows the ability to toggle default help message with a custom help flag
+ so that argparser does not throw SystemExit when --help is passed in. Useful for custom --help functionality.
+
+ Returns:
+ (argparse.ArgumentParser): argparser used by main()
+ """
+ # This variable is set so that usage errors don't show up in wrapper
+ parser = argparse.ArgumentParser(
+ conflict_handler="resolve",
+ description=description,
+ add_help=add_default_help,
+ prog="run.py",
+ )
+
+ required_group = parser.add_argument_group("required wrapper arguments")
+ required_group.add_argument("action", choices=WRAPPER_ACTIONS)
+ return parser
+
+def main() -> None:
+ """
+ Parses network folders and responsible for passing --help flags to subcommands if --network is provided.
+ """
+ # Verify python version support
+ verify_python_version()
+
+ # Get all available network scripts
+ networks = register_network_folders(os.getcwd())
+
+ # Add network folder for entry point
+ description = "Runs TensorRT networks that are based-off of NeMo variants."
+ parser = get_default_parser(description)
+
+ # Get the general network wrapper help
+ known_args, _ = parser.parse_known_args()
+
+ # Delegate parser to action specifics
+ action = get_action(known_args.action, networks, parser)
+ known_args, _ = parser.parse_known_args()
+
+ # If bootstrap occurs, then the spawned process completes the rest of demo.
+ # We can exit safely. We spawn after parsing basic args to reduce loading churn on rudimentary help commands.
+ if bootstrap_ld_library_path():
+ sys.exit(0)
+
+ return action.execute(known_args)
+
+if __name__ == "__main__":
+ main()
diff --git a/demo/NeMo/transformer_engine.patch b/demo/NeMo/transformer_engine.patch
new file mode 100644
index 00000000..c4c96dea
--- /dev/null
+++ b/demo/NeMo/transformer_engine.patch
@@ -0,0 +1,17 @@
+--- common/__init__.py 2023-06-22 17:22:59.046208583 +0000
++++ common/backup.py 2023-06-22 20:53:01.154819280 +0000
+@@ -7,12 +7,13 @@
+ import os
+ import platform
+ import subprocess
++import sys
+
+
+ def get_te_path():
+ """Find Transformer Engine install path using pip"""
+
+- command = ["pip", "show", "transformer_engine"]
++ command = [sys.executable, "-m", "pip", "show", "transformer_engine"]
+ result = subprocess.run(command, capture_output=True, check=True, text=True)
+ result = result.stdout.replace("\n", ":").split(":")
+ return result[result.index("Location")+1].strip()
diff --git a/demo/Tacotron2/README.md b/demo/Tacotron2/README.md
index db6cbb73..c687c5ee 100644
--- a/demo/Tacotron2/README.md
+++ b/demo/Tacotron2/README.md
@@ -9,11 +9,11 @@ NVIDIA TensorRT is a platform for high-performance deep learning inference. It i
|Software|Version|
|--------|-------|
-|Python|3.6.9|
-|CUDA|11.4.2|
+|Python|3.8.10|
+|CUDA|12.2|
|Apex|0.1|
-|TensorRT|8.2.0.6|
-|PyTorch|1.9.1|
+|TensorRT|9.0|
+|PyTorch|2.0.1|
## Quick Start Guide
@@ -56,7 +56,7 @@ NVIDIA TensorRT is a platform for high-performance deep learning inference. It i
```
The above commands store the generated ONNX files under the `./output/` directory:
- `encoder.onnx`, `decoder_iter.onnx`, `postnet.onnx`, `waveglow.onnx`, and `decoder.onnx` (on TensorRT 8.0+ if `--no-loop` option is not specified).
+ `encoder.onnx`, `decoder_iter.onnx`, `postnet.onnx`, `waveglow.onnx`, `loop_body_fp16.onnx`, and `decoder.onnx` (on TensorRT 8.0+ if `--no-loop` option is not specified).
6. Export the ONNX IRs to TensorRT engines with fp16 mode enabled:
diff --git a/demo/Tacotron2/common/audio_processing.py b/demo/Tacotron2/common/audio_processing.py
index 090581d5..7b261cec 100644
--- a/demo/Tacotron2/common/audio_processing.py
+++ b/demo/Tacotron2/common/audio_processing.py
@@ -64,7 +64,7 @@ def window_sumsquare(window, n_frames, hop_length=200, win_length=800,
# Compute the squared window at the desired length
win_sq = get_window(window, win_length, fftbins=True)
win_sq = librosa_util.normalize(win_sq, norm=norm)**2
- win_sq = librosa_util.pad_center(win_sq, n_fft)
+ win_sq = librosa_util.pad_center(win_sq, size=n_fft)
# Fill the envelope
for i in range(n_frames):
diff --git a/demo/Tacotron2/common/stft.py b/demo/Tacotron2/common/stft.py
index 59700e99..0341d60e 100644
--- a/demo/Tacotron2/common/stft.py
+++ b/demo/Tacotron2/common/stft.py
@@ -81,7 +81,7 @@ def __init__(self, filter_length=800, hop_length=200, win_length=800,
assert(filter_length >= win_length)
# get window and zero center pad it to filter_length
fft_window = get_window(window, win_length, fftbins=True)
- fft_window = pad_center(fft_window, filter_length)
+ fft_window = pad_center(fft_window, size=filter_length)
fft_window = torch.from_numpy(fft_window).float()
# window the bases
diff --git a/demo/Tacotron2/requirements.txt b/demo/Tacotron2/requirements.txt
index 922bb825..b6eb26de 100644
--- a/demo/Tacotron2/requirements.txt
+++ b/demo/Tacotron2/requirements.txt
@@ -1,8 +1,10 @@
-pycuda
+numba>=0.48
+resampy>=0.3.1
+torch==2.0.1
matplotlib
numpy
inflect
-librosa
+librosa>=0.10.0
scipy
Unidecode
git+https://github.com/NVIDIA/dllogger#egg=dllogger
diff --git a/demo/Tacotron2/run_latency_tests.sh b/demo/Tacotron2/run_latency_tests.sh
index a05ef258..85e5f0f8 100644
--- a/demo/Tacotron2/run_latency_tests.sh
+++ b/demo/Tacotron2/run_latency_tests.sh
@@ -1,5 +1,5 @@
#
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
diff --git a/demo/Tacotron2/scripts/download_checkpoints.sh b/demo/Tacotron2/scripts/download_checkpoints.sh
index a7ce499d..0d23f2d3 100755
--- a/demo/Tacotron2/scripts/download_checkpoints.sh
+++ b/demo/Tacotron2/scripts/download_checkpoints.sh
@@ -1,6 +1,6 @@
#!/bin/bash
#
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
diff --git a/demo/Tacotron2/scripts/inference_benchmark.sh b/demo/Tacotron2/scripts/inference_benchmark.sh
index 2e0279e4..86200557 100755
--- a/demo/Tacotron2/scripts/inference_benchmark.sh
+++ b/demo/Tacotron2/scripts/inference_benchmark.sh
@@ -1,6 +1,6 @@
#!/bin/bash
#
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -15,14 +15,7 @@
# limitations under the License.
#
-pip3 install --force-reinstall torch==1.9.1+cu111 torchvision==0.10.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
-
echo "TensorRT BS=1, S=128"
bash test_infer.sh --test tensorrt/test_infer_trt.py -bs 1 -il 128 --fp16 --num-iters 103 --encoder ./output/encoder_fp16.engine --decoder ./output/decoder_with_outer_loop_fp16.engine --postnet ./output/postnet_fp16.engine --waveglow ./output/waveglow_fp16.engine --wn-channels 256
echo "PyTorch (GPU) BS=1, S=128"
bash test_infer.sh -bs 1 -il 128 --fp16 --num-iters 103 --tacotron2 ./checkpoints/tacotron2_pyt_ckpt_amp_v19.09.0/nvidia_tacotron2pyt_fp16_20190427 --waveglow ./checkpoints/waveglow_ckpt_amp_256_v19.10.0/nvidia_waveglow256pyt_fp16 --wn-channels 256
-
-pip3 install torch==1.9.1+cpu torchvision==0.10.1+cpu -f https://download.pytorch.org/whl/torch_stable.html
-
-echo "PyTorch (CPU) BS=1, S=128"
-bash test_infer.sh -bs 1 -il 128 --fp16 --num-iters 5 --tacotron2 ./checkpoints/tacotron2_pyt_ckpt_amp_v19.09.0/nvidia_tacotron2pyt_fp16_20190427 --waveglow ./checkpoints/waveglow_ckpt_amp_256_v19.10.0/nvidia_waveglow256pyt_fp16 --wn-channels 256 --cpu
diff --git a/demo/Tacotron2/scripts/install_prerequisites.sh b/demo/Tacotron2/scripts/install_prerequisites.sh
index 5e5e1f97..5a16d392 100755
--- a/demo/Tacotron2/scripts/install_prerequisites.sh
+++ b/demo/Tacotron2/scripts/install_prerequisites.sh
@@ -1,6 +1,6 @@
#!/bin/bash
#
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -15,12 +15,11 @@
# limitations under the License.
#
-pip3 install numba==0.48 resampy==0.3.1 torch==1.9.1
pip3 install -r requirements.txt
echo "nvidia" | sudo -S apt-get install -y libsndfile1
pushd /tmp
git clone https://github.com/NVIDIA/apex
cd apex
-pip3 install -v --no-cache-dir ./
+pip3 install -v --disable-pip-version-check --no-build-isolation --no-cache-dir ./
popd
diff --git a/demo/Tacotron2/scripts/prepare_dataset.sh b/demo/Tacotron2/scripts/prepare_dataset.sh
index 7d3acb9b..d38be817 100755
--- a/demo/Tacotron2/scripts/prepare_dataset.sh
+++ b/demo/Tacotron2/scripts/prepare_dataset.sh
@@ -1,6 +1,6 @@
#!/usr/bin/env bash
#
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
diff --git a/demo/Tacotron2/scripts/prepare_mels.sh b/demo/Tacotron2/scripts/prepare_mels.sh
index cb02f775..b3843a26 100644
--- a/demo/Tacotron2/scripts/prepare_mels.sh
+++ b/demo/Tacotron2/scripts/prepare_mels.sh
@@ -1,6 +1,6 @@
#!/usr/bin/env bash
#
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
diff --git a/demo/Tacotron2/tensorrt/convert_onnx2trt.py b/demo/Tacotron2/tensorrt/convert_onnx2trt.py
index ec43cb05..dd24c801 100644
--- a/demo/Tacotron2/tensorrt/convert_onnx2trt.py
+++ b/demo/Tacotron2/tensorrt/convert_onnx2trt.py
@@ -16,9 +16,6 @@
#
import argparse
-import onnx
-import pycuda.autoinit
-import pycuda.driver as cuda
import sys
import tensorrt as trt
from os.path import join
@@ -62,7 +59,6 @@ def parse_args(parser):
parser.add_argument("-tcf", "--timing-cache-file", default=None, type=str,
help="Path to tensorrt build timeing cache file, only available for tensorrt 8.0 and later. The cache file is assumed to be used exclusively. It's the users' responsibility to create file lock to prevent accessing conflict.",
required=False)
- parser.add_argument("--disable-preview-dynamic-shapes", action="store_true", help="Disable dynamic shape preview feature.")
parser.set_defaults(loop=int(trt.__version__[0]) >= 8)
return parser
@@ -89,10 +85,10 @@ def main():
{"name": "sequence_lengths", "min": (bs_min,), "opt": (bs_opt,), "max": (bs_max,)}]
if args.encoder != "":
print("Building Encoder ...")
- encoder_engine = build_engine(args.encoder, shapes=shapes, fp16=args.fp16, timing_cache=args.timing_cache_file, disable_preview_dynamic_shapes=args.disable_preview_dynamic_shapes)
+ encoder_engine = build_engine(args.encoder, shapes=shapes, fp16=args.fp16, timing_cache=args.timing_cache_file)
if encoder_engine is not None:
with open(encoder_path, 'wb') as f:
- f.write(encoder_engine.serialize())
+ f.write(encoder_engine)
else:
print("Failed to build engine from", args.encoder)
sys.exit(1)
@@ -112,10 +108,10 @@ def main():
{"name": "mask", "min": (bs_min,4), "opt": (bs_opt,128), "max": (bs_max,256)}]
if args.decoder != "":
print("Building Decoder with loop...")
- decoder_engine = build_engine(args.decoder, shapes=shapes, fp16=args.fp16, timing_cache=args.timing_cache_file, disable_preview_dynamic_shapes=args.disable_preview_dynamic_shapes)
+ decoder_engine = build_engine(args.decoder, shapes=shapes, fp16=args.fp16, timing_cache=args.timing_cache_file)
if decoder_engine is not None:
with open(decoder_path, 'wb') as f:
- f.write(decoder_engine.serialize())
+ f.write(decoder_engine)
else:
print("Failed to build engine from", args.decoder)
sys.exit(1)
@@ -134,10 +130,10 @@ def main():
{"name": "mask", "min": (bs_min,4), "opt": (bs_opt,128), "max": (bs_max,256)}]
if args.decoder != "":
print("Building Decoder ...")
- decoder_iter_engine = build_engine(args.decoder, shapes=shapes, fp16=args.fp16, timing_cache=args.timing_cache_file, disable_preview_dynamic_shapes=args.disable_preview_dynamic_shapes)
+ decoder_iter_engine = build_engine(args.decoder, shapes=shapes, fp16=args.fp16, timing_cache=args.timing_cache_file)
if decoder_iter_engine is not None:
with open(decoder_path, 'wb') as f:
- f.write(decoder_iter_engine.serialize())
+ f.write(decoder_iter_engine)
else:
print("Failed to build engine from", args.decoder)
sys.exit(1)
@@ -146,10 +142,10 @@ def main():
shapes=[{"name": "mel_outputs", "min": (bs_min,80,32), "opt": (bs_opt,80,768), "max": (bs_max,80,1664)}]
if args.postnet != "":
print("Building Postnet ...")
- postnet_engine = build_engine(args.postnet, shapes=shapes, fp16=args.fp16, timing_cache=args.timing_cache_file, disable_preview_dynamic_shapes=args.disable_preview_dynamic_shapes)
+ postnet_engine = build_engine(args.postnet, shapes=shapes, fp16=args.fp16, timing_cache=args.timing_cache_file)
if postnet_engine is not None:
with open(postnet_path, 'wb') as f:
- f.write(postnet_engine.serialize())
+ f.write(postnet_engine)
else:
print("Failed to build engine from", args.postnet)
sys.exit(1)
@@ -159,10 +155,10 @@ def main():
{"name": "z", "min": (bs_min,8,z_min,1), "opt": (bs_opt,8,z_opt,1), "max": (bs_max,8,z_max,1)}]
if args.waveglow != "":
print("Building WaveGlow ...")
- waveglow_engine = build_engine(args.waveglow, shapes=shapes, fp16=args.fp16, timing_cache=args.timing_cache_file, disable_preview_dynamic_shapes=args.disable_preview_dynamic_shapes)
+ waveglow_engine = build_engine(args.waveglow, shapes=shapes, fp16=args.fp16, timing_cache=args.timing_cache_file)
if waveglow_engine is not None:
with open(waveglow_path, 'wb') as f:
- f.write(waveglow_engine.serialize())
+ f.write(waveglow_engine)
else:
print("Failed to build engine from", args.waveglow)
sys.exit(1)
diff --git a/demo/Tacotron2/tensorrt/inference_trt.py b/demo/Tacotron2/tensorrt/inference_trt.py
index 4f5f76d3..d1a6dabd 100644
--- a/demo/Tacotron2/tensorrt/inference_trt.py
+++ b/demo/Tacotron2/tensorrt/inference_trt.py
@@ -437,7 +437,8 @@ def main():
measurements = {}
sequences, sequence_lengths = prepare_input_sequence(texts)
- sequences = sequences.to(torch.int32)
+ dt = encoder.get_tensor_dtype("sequences")
+ sequences = sequences.to(torch.int64 if dt == trt.DataType.INT64 else torch.int32)
sequence_lengths = sequence_lengths.to(torch.int32)
with MeasureTime(measurements, "latency"):
diff --git a/demo/Tacotron2/tensorrt/run_latency_tests_trt.sh b/demo/Tacotron2/tensorrt/run_latency_tests_trt.sh
index 07dfd704..a289cf63 100644
--- a/demo/Tacotron2/tensorrt/run_latency_tests_trt.sh
+++ b/demo/Tacotron2/tensorrt/run_latency_tests_trt.sh
@@ -1,5 +1,5 @@
#
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
diff --git a/demo/Tacotron2/tensorrt/trt_utils.py b/demo/Tacotron2/tensorrt/trt_utils.py
index 3e1d534a..e150983f 100644
--- a/demo/Tacotron2/tensorrt/trt_utils.py
+++ b/demo/Tacotron2/tensorrt/trt_utils.py
@@ -45,18 +45,18 @@ def is_shape_dynamic(shape):
def run_trt_engine(context, engine, tensors):
- bindings = [None]*engine.num_bindings
- for name,tensor in tensors['inputs'].items():
- idx = engine.get_binding_index(name)
- bindings[idx] = tensor.data_ptr()
- if engine.is_shape_binding(idx) and is_shape_dynamic(context.get_shape(idx)):
- context.set_shape_input(idx, tensor)
- elif is_shape_dynamic(engine.get_binding_shape(idx)):
- context.set_binding_shape(idx, tensor.shape)
-
- for name,tensor in tensors['outputs'].items():
- idx = engine.get_binding_index(name)
- bindings[idx] = tensor.data_ptr()
+ bindings = [0] * engine.num_io_tensors
+
+ for i in range(engine.num_io_tensors):
+ tensor_name = engine.get_tensor_name(i)
+ if engine.get_tensor_mode(tensor_name) == trt.TensorIOMode.INPUT:
+ tensor = tensors['inputs'][tensor_name]
+ bindings[i] = tensor.data_ptr()
+ if is_shape_dynamic(engine.get_tensor_shape(tensor_name)):
+ context.set_input_shape(tensor_name, tensor.shape)
+ elif engine.get_tensor_mode(tensor_name) == trt.TensorIOMode.OUTPUT:
+ tensor = tensors['outputs'][tensor_name]
+ bindings[i] = tensor.data_ptr()
context.execute_v2(bindings=bindings)
@@ -84,22 +84,20 @@ def engine_info(engine_filepath):
"DataType.BOOL" : "TYPE_BOOL"}
print("engine name", engine.name)
- print("has_implicit_batch_dimension", engine.has_implicit_batch_dimension)
- start_dim = 0 if engine.has_implicit_batch_dimension else 1
+ start_dim = 1
print("num_optimization_profiles", engine.num_optimization_profiles)
- print("max_batch_size:", engine.max_batch_size)
print("device_memory_size:", engine.device_memory_size)
- print("max_workspace_size:", engine.max_workspace_size)
+ print("max_workspace_size:", engine.get_memory_pool_limit(trt.MemoryPoolType.WORKSPACE))
print("num_layers:", engine.num_layers)
- for i in range(engine.num_bindings):
- btype = "input" if engine.binding_is_input(i) else "output"
- bname = engine.get_binding_name(i)
- dtype = engine.get_binding_dtype(i)
- bdims = engine.get_binding_shape(i)
+ for i in range(engine.num_io_tensors):
+ tensor_name = engine.get_tensor_name(i)
+ btype = "input" if engine.get_tensor_mode(tensor_name) == trt.TensorIOMode.INPUT else "output"
+ dtype = engine.get_tensor_dtype(tensor_name)
+ bdims = engine.get_tensor_shape(tensor_name)
config_values = {
"btype": btype,
- "bname": bname,
+ "bname": tensor_name,
"dtype": type_mapping[str(dtype)],
"dims": list(bdims[start_dim:])
}
@@ -107,19 +105,15 @@ def engine_info(engine_filepath):
print(final_binding_str)
-def build_engine(model_file, shapes, max_ws=512*1024*1024, fp16=False, timing_cache=None, disable_preview_dynamic_shapes=False):
- if not disable_preview_dynamic_shapes and float(trt.__version__[:3]) < 8.5:
- print("Faster dynamic shapes preview feature is only supported on TRT 8.5+")
- sys.exit(1)
+def build_engine(model_file, shapes, max_ws=512*1024*1024, fp16=False, timing_cache=None):
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(TRT_LOGGER)
config = builder.create_builder_config()
- config.max_workspace_size = max_ws
+ config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, max_ws)
if fp16:
config.flags |= 1 << int(trt.BuilderFlag.FP16)
- config.set_preview_feature(trt.PreviewFeature.FASTER_DYNAMIC_SHAPES_0805, not disable_preview_dynamic_shapes)
profile = builder.create_optimization_profile()
for s in shapes:
profile.set_shape(s['name'], min=s['min'], opt=s['opt'], max=s['max'])
@@ -136,15 +130,17 @@ def build_engine(model_file, shapes, max_ws=512*1024*1024, fp16=False, timing_ca
cache = config.create_timing_cache(b"")
config.set_timing_cache(cache, ignore_mismatch = False)
- explicit_batch = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
- network = builder.create_network(explicit_batch)
+ network_creation_flag = 0
+ if "EXPLICIT_BATCH" in trt.NetworkDefinitionCreationFlag.__members__.keys():
+ network_creation_flag = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
+ network = builder.create_network(network_creation_flag)
with trt.OnnxParser(network, TRT_LOGGER) as parser:
with open(model_file, 'rb') as model:
parsed = parser.parse(model.read())
for i in range(parser.num_errors):
print("TensorRT ONNX parser error:", parser.get_error(i))
- engine = builder.build_engine(network, config=config)
+ engine = builder.build_serialized_network(network, config=config)
# save global timing cache
if timing_cache_available:
diff --git a/demo/Tacotron2/test_infer.py b/demo/Tacotron2/test_infer.py
index 81254d37..23816da9 100644
--- a/demo/Tacotron2/test_infer.py
+++ b/demo/Tacotron2/test_infer.py
@@ -15,23 +15,16 @@
# limitations under the License.
#
-from tacotron2.text import text_to_sequence
-import models
import torch
import argparse
import numpy as np
from scipy.io.wavfile import write
-import sys
+from inference import MeasureTime, prepare_input_sequence, load_and_setup_model
-from inference import checkpoint_from_distributed, unwrap_distributed, MeasureTime, prepare_input_sequence, load_and_setup_model
-
-import time
import dllogger as DLLogger
from dllogger import StdOutBackend, JSONStreamBackend, Verbosity
-from apex import amp
-
from waveglow.denoiser import Denoiser
def parse_args(parser):
diff --git a/demo/Tacotron2/test_infer.sh b/demo/Tacotron2/test_infer.sh
index fd0e7ecb..103fb941 100644
--- a/demo/Tacotron2/test_infer.sh
+++ b/demo/Tacotron2/test_infer.sh
@@ -1,6 +1,6 @@
#!/bin/bash
#
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
diff --git a/demo/experimental/HuggingFace-Diffusers/README.md b/demo/experimental/HuggingFace-Diffusers/README.md
index 9cf26cfe..d0e4e563 100644
--- a/demo/experimental/HuggingFace-Diffusers/README.md
+++ b/demo/experimental/HuggingFace-Diffusers/README.md
@@ -7,7 +7,7 @@ This demo notebook showcases the acceleration of Stable Diffusion pipeline using
### Clone the TensorRT OSS repository
```bash
-git clone git@github.com:NVIDIA/TensorRT.git -b release/8.6 --single-branch
+git clone git@github.com:NVIDIA/TensorRT.git -b release/9.3 --single-branch
cd TensorRT/demo/experimental/HuggingFace-Diffusers
```
diff --git a/demo/experimental/HuggingFace-Diffusers/TensorRT-diffusers-txt2img.ipynb b/demo/experimental/HuggingFace-Diffusers/TensorRT-diffusers-txt2img.ipynb
index 395a8f27..23eb1492 100644
--- a/demo/experimental/HuggingFace-Diffusers/TensorRT-diffusers-txt2img.ipynb
+++ b/demo/experimental/HuggingFace-Diffusers/TensorRT-diffusers-txt2img.ipynb
@@ -160,7 +160,7 @@
"source": [
"### Install NVIDIA TensorRT\n",
"\n",
- "TensorRT 8.6 includes Stable Diffusion model optimizations out of the box."
+ "TensorRT 8.6+ includes Stable Diffusion model optimizations out of the box."
]
},
{
diff --git a/docker/build.sh b/docker/build.sh
index 6b28fd09..b24029ae 100755
--- a/docker/build.sh
+++ b/docker/build.sh
@@ -42,7 +42,7 @@ then
echo "--cuda not specified, so not passing in --build-arg CUDA_VERSION to Dockerfile"
docker_args="-f $arg_dockerfile --build-arg uid=$(id -u) --build-arg gid=$(id -g) --tag=$arg_imagename ."
else
- docker_args="-f $arg_dockerfile --build-arg CUDA_VERSION=$arg_cudaversion --build-arg uid=$(id -u) --build-arg gid=$(id -g) --tag=$arg_imagename ."
+ docker_args="-f $arg_dockerfile --build-arg CUDA_VERSION=$arg_cudaversion --build-arg CUDA_VERSION_MAJOR_MINOR=${arg_cudaversion:0:4} --build-arg uid=$(id -u) --build-arg gid=$(id -g) --tag=$arg_imagename ."
fi
echo "Building container:"
diff --git a/docker/centos-7.Dockerfile b/docker/centos-7.Dockerfile
deleted file mode 100644
index ff27d6d2..00000000
--- a/docker/centos-7.Dockerfile
+++ /dev/null
@@ -1,105 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-ARG CUDA_VERSION=12.0.1
-
-FROM nvidia/cuda:${CUDA_VERSION}-cudnn8-devel-centos7
-LABEL maintainer="NVIDIA CORPORATION"
-
-ENV TRT_VERSION 8.6.1.6
-SHELL ["/bin/bash", "-c"]
-
-# Setup user account
-ARG uid=1000
-ARG gid=1000
-RUN groupadd -r -f -g ${gid} trtuser && useradd -o -r -l -u ${uid} -g ${gid} -ms /bin/bash trtuser
-RUN usermod -aG wheel trtuser
-RUN echo 'trtuser:nvidia' | chpasswd
-RUN mkdir -p /workspace && chown trtuser /workspace
-
-# Install requried packages
-RUN yum -y groupinstall "Development Tools"
-RUN yum -y install \
- openssl-devel \
- bzip2-devel \
- libffi-devel \
- wget \
- perl-core \
- git \
- pkg-config \
- unzip \
- sudo
-
-# Install python3
-RUN yum install -y python36 python3-devel
-
-# yum needs to use python2
-RUN sed -i "1s/python/python2/" /usr/bin/yum
-
-# Install TensorRT
-RUN if [ "${CUDA_VERSION}" = "10.2" ] ; then \
- v="${TRT_VERSION%.*}-1.cuda${CUDA_VERSION}" &&\
- yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo &&\
- yum -y install libnvinfer8-${v} libnvparsers8-${v} libnvonnxparsers8-${v} libnvinfer-plugin8-${v} \
- libnvinfer-devel-${v} libnvparsers-devel-${v} libnvonnxparsers-devel-${v} libnvinfer-plugin-devel-${v} \
- python3-libnvinfer-=${v} libnvinfer-dispatch8-=${v} libnvinfer-dispatch-devel-=${v} libnvinfer-lean8-=${v} \
- libnvinfer-lean-devel-=${v} libnvinfer-vc-plugin8-=${v} libnvinfer-vc-plugin-devel-=${v} \
- libnvinfer-headers-devel-=${v} libnvinfer-headers-plugin-devel-=${v}; \
-else \
- v="${TRT_VERSION}-1.cuda${CUDA_VERSION%.*}" &&\
- yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo &&\
- yum -y install libnvinfer8-${v} libnvparsers8-${v} libnvonnxparsers8-${v} libnvinfer-plugin8-${v} \
- libnvinfer-devel-${v} libnvparsers-devel-${v} libnvonnxparsers-devel-${v} libnvinfer-plugin-devel-${v} \
- python3-libnvinfer-=${v} libnvinfer-dispatch8-=${v} libnvinfer-dispatch-devel-=${v} libnvinfer-lean8-=${v} \
- libnvinfer-lean-devel-=${v} libnvinfer-vc-plugin8-=${v} libnvinfer-vc-plugin-devel-=${v} \
- libnvinfer-headers-devel-=${v} libnvinfer-headers-plugin-devel-=${v}; \
-fi
-
-# Install dev-toolset-8 for g++ version that supports c++14
-RUN yum -y install centos-release-scl
-RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
-RUN yum -y install devtoolset-8
-
-# Install PyPI packages
-RUN pip3 install --upgrade pip
-RUN pip3 install setuptools>=41.0.0
-RUN pip3 install numpy
-RUN pip3 install jupyter jupyterlab
-
-# Install Cmake
-RUN cd /tmp && \
- wget https://github.com/Kitware/CMake/releases/download/v3.14.4/cmake-3.14.4-Linux-x86_64.sh && \
- chmod +x cmake-3.14.4-Linux-x86_64.sh && \
- ./cmake-3.14.4-Linux-x86_64.sh --prefix=/usr/local --exclude-subdir --skip-license && \
- rm ./cmake-3.14.4-Linux-x86_64.sh
-
-# Download NGC client
-RUN cd /usr/local/bin && wget https://ngc.nvidia.com/downloads/ngccli_cat_linux.zip && unzip ngccli_cat_linux.zip && chmod u+x ngc-cli/ngc && rm ngccli_cat_linux.zip ngc-cli.md5 && echo "no-apikey\nascii\n" | ngc-cli/ngc config set
-
-RUN rm /usr/bin/python && ln -s /usr/bin/python3 /usr/bin/python
-
-# Set environment and working directory
-ENV TRT_LIBPATH /usr/lib/x86_64-linux-gnu
-ENV TRT_OSSPATH /workspace/TensorRT
-ENV PATH="${PATH}:/usr/local/bin/ngc-cli"
-ENV LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:${TRT_OSSPATH}/build/out:${TRT_LIBPATH}"
-# Use devtoolset-8 as default compiler
-ENV PATH="/opt/rh/devtoolset-8/root/bin:${PATH}"
-WORKDIR /workspace
-
-USER trtuser
-RUN ["/bin/bash"]
diff --git a/docker/ubuntu-20.04-aarch64.Dockerfile b/docker/ubuntu-20.04-aarch64.Dockerfile
deleted file mode 100644
index 540943cd..00000000
--- a/docker/ubuntu-20.04-aarch64.Dockerfile
+++ /dev/null
@@ -1,108 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-ARG CUDA_VERSION=12.0.1
-
-# Multi-arch container support available in non-cudnn containers.
-FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04
-
-ENV TRT_VERSION 8.6.1.6
-SHELL ["/bin/bash", "-c"]
-
-# Setup user account
-ARG uid=1000
-ARG gid=1000
-RUN groupadd -r -f -g ${gid} trtuser && useradd -o -r -l -u ${uid} -g ${gid} -ms /bin/bash trtuser
-RUN usermod -aG sudo trtuser
-RUN echo 'trtuser:nvidia' | chpasswd
-RUN mkdir -p /workspace && chown trtuser /workspace
-
-# Required to build Ubuntu 20.04 without user prompts with DLFW container
-ENV DEBIAN_FRONTEND=noninteractive
-
-# Update CUDA signing key
-RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/sbsa/3bf863cc.pub
-
-# Install requried libraries
-RUN apt-get update && apt-get install -y software-properties-common
-RUN add-apt-repository ppa:ubuntu-toolchain-r/test
-RUN apt-get update && apt-get install -y --no-install-recommends \
- libcurl4-openssl-dev \
- wget \
- git \
- pkg-config \
- sudo \
- ssh \
- libssl-dev \
- pbzip2 \
- pv \
- bzip2 \
- unzip \
- devscripts \
- lintian \
- fakeroot \
- dh-make \
- build-essential
-
-# Install python3
-RUN apt-get install -y --no-install-recommends \
- python3 \
- python3-pip \
- python3-dev \
- python3-wheel &&\
- cd /usr/local/bin &&\
- ln -s /usr/bin/python3 python &&\
- ln -s /usr/bin/pip3 pip;
-
-# Install TensorRT. This will also pull in CUDNN
-RUN v="${TRT_VERSION}-1+cuda${CUDA_VERSION%.*}" &&\
- apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/sbsa/3bf863cc.pub &&\
- apt-get update &&\
- sudo apt-get -y install libnvinfer8=${v} libnvonnxparsers8=${v} libnvparsers8=${v} libnvinfer-plugin8=${v} \
- libnvinfer-dev=${v} libnvonnxparsers-dev=${v} libnvparsers-dev=${v} libnvinfer-plugin-dev=${v} \
- python3-libnvinfer=${v} libnvinfer-dispatch8=${v} libnvinfer-dispatch-dev=${v} libnvinfer-lean8=${v} \
- libnvinfer-lean-dev=${v} libnvinfer-vc-plugin8=${v} libnvinfer-vc-plugin-dev=${v} \
- libnvinfer-headers-dev=${v} libnvinfer-headers-plugin-dev=${v};
-
-# Install Cmake
-RUN cd /tmp && \
- wget https://github.com/Kitware/CMake/releases/download/v3.21.4/cmake-3.21.4-linux-aarch64.sh && \
- chmod +x cmake-3.21.4-linux-aarch64.sh && \
- ./cmake-3.21.4-linux-aarch64.sh --prefix=/usr/local --exclude-subdir --skip-license && \
- rm ./cmake-3.21.4-linux-aarch64.sh
-
-# Install PyPI packages
-RUN pip3 install --upgrade pip
-RUN pip3 install setuptools>=41.0.0
-COPY requirements.txt /tmp/requirements.txt
-RUN pip3 install -r /tmp/requirements.txt
-RUN pip3 install jupyter jupyterlab
-# Workaround to remove numpy installed with tensorflow
-RUN pip3 install --upgrade numpy
-
-# Download NGC client
-RUN cd /usr/local/bin && wget https://ngc.nvidia.com/downloads/ngccli_arm64.zip && unzip ngccli_arm64.zip && chmod u+x ngc-cli/ngc && rm ngccli_arm64.zip ngc-cli.md5 && echo "no-apikey\nascii\n" | ngc-cli/ngc config set
-
-# Set environment and working directory
-ENV TRT_LIBPATH /usr/lib/aarch64-linux-gnu/
-ENV TRT_OSSPATH /workspace/TensorRT
-ENV PATH="${PATH}:/usr/local/bin/ngc-cli"
-ENV LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:${TRT_OSSPATH}/build/out:${TRT_LIBPATH}"
-WORKDIR /workspace
-
-USER trtuser
-RUN ["/bin/bash"]
diff --git a/docker/ubuntu-20.04.Dockerfile b/docker/ubuntu-20.04.Dockerfile
index 65605b47..0049d4c2 100644
--- a/docker/ubuntu-20.04.Dockerfile
+++ b/docker/ubuntu-20.04.Dockerfile
@@ -1,5 +1,5 @@
#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
@@ -15,14 +15,28 @@
# limitations under the License.
#
-ARG CUDA_VERSION=12.0.1
+ARG CUDA_VERSION=12.3.2
-FROM nvidia/cuda:${CUDA_VERSION}-cudnn8-devel-ubuntu20.04
+FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04
LABEL maintainer="NVIDIA CORPORATION"
-ENV TRT_VERSION 8.6.1.6
+ENV NV_CUDNN_VERSION 8.9.6.50
+ENV NV_CUDNN_PACKAGE_NAME "libcudnn8"
+
+ENV CUDA_VERSION_MAJOR_MINOR=12.2
+
+ENV NV_CUDNN_PACKAGE "libcudnn8=$NV_CUDNN_VERSION-1+cuda${CUDA_VERSION_MAJOR_MINOR}"
+ENV NV_CUDNN_PACKAGE_DEV "libcudnn8-dev=$NV_CUDNN_VERSION-1+cuda${CUDA_VERSION_MAJOR_MINOR}"
+
+ENV TRT_VERSION 10.0.0.6
SHELL ["/bin/bash", "-c"]
+RUN apt-get update && apt-get install -y --no-install-recommends \
+ ${NV_CUDNN_PACKAGE} \
+ ${NV_CUDNN_PACKAGE_DEV} \
+ && apt-mark hold ${NV_CUDNN_PACKAGE_NAME} \
+ && rm -rf /var/lib/apt/lists/*
+
# Setup user account
ARG uid=1000
ARG gid=1000
@@ -69,24 +83,19 @@ RUN apt-get install -y --no-install-recommends \
ln -s /usr/bin/pip3 pip;
# Install TensorRT
-RUN if [ "${CUDA_VERSION}" = "10.2" ] ; then \
- v="${TRT_VERSION%.*}-1+cuda${CUDA_VERSION}" &&\
- apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub &&\
- apt-get update &&\
- sudo apt-get install libnvinfer8=${v} libnvonnxparsers8=${v} libnvparsers8=${v} libnvinfer-plugin8=${v} \
- libnvinfer-dev=${v} libnvonnxparsers-dev=${v} libnvparsers-dev=${v} libnvinfer-plugin-dev=${v} \
- python3-libnvinfer=${v} libnvinfer-dispatch8=${v} libnvinfer-dispatch-dev=${v} libnvinfer-lean8=${v} \
- libnvinfer-lean-dev=${v} libnvinfer-vc-plugin8=${v} libnvinfer-vc-plugin-dev=${v} \
- libnvinfer-headers-dev=${v} libnvinfer-headers-plugin-dev=${v}; \
+RUN if [ "${CUDA_VERSION:0:2}" = "11" ]; then \
+ wget https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.0.0/tars/TensorRT-10.0.0.6.Linux.x86_64-gnu.cuda-11.8.tar.gz \
+ && tar -xf TensorRT-10.0.0.6.Linux.x86_64-gnu.cuda-11.8.tar.gz \
+ && cp -a TensorRT-10.0.0.6/lib/*.so* /usr/lib/x86_64-linux-gnu \
+ && pip install TensorRT-10.0.0.6/python/tensorrt-10.0.0b6-cp38-none-linux_x86_64.whl ;\
+elif [ "${CUDA_VERSION:0:2}" = "12" ]; then \
+ wget https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.0.0/tars/TensorRT-10.0.0.6.Linux.x86_64-gnu.cuda-12.4.tar.gz \
+ && tar -xf TensorRT-10.0.0.6.Linux.x86_64-gnu.cuda-12.4.tar.gz \
+ && cp -a TensorRT-10.0.0.6/lib/*.so* /usr/lib/x86_64-linux-gnu \
+ && pip install TensorRT-10.0.0.6/python/tensorrt-10.0.0b6-cp38-none-linux_x86_64.whl ;\
else \
- v="${TRT_VERSION}-1+cuda${CUDA_VERSION%.*}" &&\
- apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub &&\
- apt-get update &&\
- sudo apt-get -y install libnvinfer8=${v} libnvonnxparsers8=${v} libnvparsers8=${v} libnvinfer-plugin8=${v} \
- libnvinfer-dev=${v} libnvonnxparsers-dev=${v} libnvparsers-dev=${v} libnvinfer-plugin-dev=${v} \
- python3-libnvinfer=${v} libnvinfer-dispatch8=${v} libnvinfer-dispatch-dev=${v} libnvinfer-lean8=${v} \
- libnvinfer-lean-dev=${v} libnvinfer-vc-plugin8=${v} libnvinfer-vc-plugin-dev=${v} \
- libnvinfer-headers-dev=${v} libnvinfer-headers-plugin-dev=${v}; \
+ echo "Invalid CUDA_VERSION"; \
+ exit 1; \
fi
# Install PyPI packages
diff --git a/docker/ubuntu-18.04.Dockerfile b/docker/ubuntu-22.04.Dockerfile
similarity index 58%
rename from docker/ubuntu-18.04.Dockerfile
rename to docker/ubuntu-22.04.Dockerfile
index 8c246126..ebe90f71 100644
--- a/docker/ubuntu-18.04.Dockerfile
+++ b/docker/ubuntu-22.04.Dockerfile
@@ -1,5 +1,5 @@
#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
@@ -15,14 +15,28 @@
# limitations under the License.
#
-ARG CUDA_VERSION=12.0.1
+ARG CUDA_VERSION=12.3.2
-FROM nvidia/cuda:${CUDA_VERSION}-cudnn8-devel-ubuntu18.04
+FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu22.04
LABEL maintainer="NVIDIA CORPORATION"
-ENV TRT_VERSION 8.6.1.6
+ENV NV_CUDNN_VERSION 8.9.6.50
+ENV NV_CUDNN_PACKAGE_NAME "libcudnn8"
+
+ENV CUDA_VERSION_MAJOR_MINOR=12.2
+
+ENV NV_CUDNN_PACKAGE "libcudnn8=$NV_CUDNN_VERSION-1+cuda${CUDA_VERSION_MAJOR_MINOR}"
+ENV NV_CUDNN_PACKAGE_DEV "libcudnn8-dev=$NV_CUDNN_VERSION-1+cuda${CUDA_VERSION_MAJOR_MINOR}"
+
+ENV TRT_VERSION 10.0.0.6
SHELL ["/bin/bash", "-c"]
+RUN apt-get update && apt-get install -y --no-install-recommends \
+ ${NV_CUDNN_PACKAGE} \
+ ${NV_CUDNN_PACKAGE_DEV} \
+ && apt-mark hold ${NV_CUDNN_PACKAGE_NAME} \
+ && rm -rf /var/lib/apt/lists/*
+
# Setup user account
ARG uid=1000
ARG gid=1000
@@ -31,6 +45,12 @@ RUN usermod -aG sudo trtuser
RUN echo 'trtuser:nvidia' | chpasswd
RUN mkdir -p /workspace && chown trtuser /workspace
+# Required to build Ubuntu 20.04 without user prompts with DLFW container
+ENV DEBIAN_FRONTEND=noninteractive
+
+# Update CUDA signing key
+RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub
+
# Install requried libraries
RUN apt-get update && apt-get install -y software-properties-common
RUN add-apt-repository ppa:ubuntu-toolchain-r/test
@@ -63,29 +83,26 @@ RUN apt-get install -y --no-install-recommends \
ln -s /usr/bin/pip3 pip;
# Install TensorRT
-RUN if [ "${CUDA_VERSION}" = "10.2" ] ; then \
- v="${TRT_VERSION%.*}-1+cuda${CUDA_VERSION}" &&\
- apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub &&\
- apt-get update &&\
- sudo apt-get install libnvinfer8=${v} libnvonnxparsers8=${v} libnvparsers8=${v} libnvinfer-plugin8=${v} \
- libnvinfer-dev=${v} libnvonnxparsers-dev=${v} libnvparsers-dev=${v} libnvinfer-plugin-dev=${v} \
- python3-libnvinfer=${v} libnvinfer-dispatch8=${v} libnvinfer-dispatch-dev=${v} libnvinfer-lean8=${v} \
- libnvinfer-lean-dev=${v} libnvinfer-vc-plugin8=${v} libnvinfer-vc-plugin-dev=${v} \
- libnvinfer-headers-dev=${v} libnvinfer-headers-plugin-dev=${v}; \
+RUN if [ "${CUDA_VERSION:0:2}" = "11" ]; then \
+ wget https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.0.0/tars/TensorRT-10.0.0.6.Linux.x86_64-gnu.cuda-11.8.tar.gz \
+ && tar -xf TensorRT-10.0.0.6.Linux.x86_64-gnu.cuda-11.8.tar.gz \
+ && cp -a TensorRT-10.0.0.6/lib/*.so* /usr/lib/x86_64-linux-gnu \
+ && pip install TensorRT-10.0.0.6/python/tensorrt-10.0.0b6-cp310-none-linux_x86_64.whl ;\
+elif [ "${CUDA_VERSION:0:2}" = "12" ]; then \
+ wget https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.0.0/tars/TensorRT-10.0.0.6.Linux.x86_64-gnu.cuda-12.4.tar.gz \
+ && tar -xf TensorRT-10.0.0.6.Linux.x86_64-gnu.cuda-12.4.tar.gz \
+ && cp -a TensorRT-10.0.0.6/lib/*.so* /usr/lib/x86_64-linux-gnu \
+ && pip install TensorRT-10.0.0.6/python/tensorrt-10.0.0b6-cp310-none-linux_x86_64.whl ;\
else \
- v="${TRT_VERSION}-1+cuda${CUDA_VERSION%.*}" &&\
- apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub &&\
- apt-get update &&\
- sudo apt-get -y install libnvinfer8=${v} libnvonnxparsers8=${v} libnvparsers8=${v} libnvinfer-plugin8=${v} \
- libnvinfer-dev=${v} libnvonnxparsers-dev=${v} libnvparsers-dev=${v} libnvinfer-plugin-dev=${v} \
- python3-libnvinfer=${v} libnvinfer-dispatch8=${v} libnvinfer-dispatch-dev=${v} libnvinfer-lean8=${v} \
- libnvinfer-lean-dev=${v} libnvinfer-vc-plugin8=${v} libnvinfer-vc-plugin-dev=${v} \
- libnvinfer-headers-dev=${v} libnvinfer-headers-plugin-dev=${v}; \
+ echo "Invalid CUDA_VERSION"; \
+ exit 1; \
fi
# Install PyPI packages
RUN pip3 install --upgrade pip
RUN pip3 install setuptools>=41.0.0
+COPY requirements.txt /tmp/requirements.txt
+RUN pip3 install -r /tmp/requirements.txt
RUN pip3 install jupyter jupyterlab
# Workaround to remove numpy installed with tensorflow
RUN pip3 install --upgrade numpy
diff --git a/docker/ubuntu-cross-aarch64.Dockerfile b/docker/ubuntu-cross-aarch64.Dockerfile
deleted file mode 100644
index cf5f31d9..00000000
--- a/docker/ubuntu-cross-aarch64.Dockerfile
+++ /dev/null
@@ -1,134 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-ARG CUDA_VERSION=11.4.1
-
-# Multi-arch container support available in non-cudnn containers.
-FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04
-LABEL maintainer="NVIDIA CORPORATION"
-
-ENV TRT_VERSION 8.5.2
-ENV DEBIAN_FRONTEND=noninteractive
-
-ARG uid=1000
-ARG gid=1000
-RUN groupadd -r -f -g ${gid} trtuser && useradd -o -r -l -u ${uid} -g ${gid} -ms /bin/bash trtuser
-RUN usermod -aG sudo trtuser
-RUN echo 'trtuser:nvidia' | chpasswd
-RUN mkdir -p /workspace && chown trtuser /workspace
-
-# Install requried libraries
-RUN apt-get update && apt-get install -y software-properties-common
-RUN add-apt-repository ppa:ubuntu-toolchain-r/test
-RUN apt-get update && apt-get install -y --no-install-recommends \
- libcurl4-openssl-dev \
- wget \
- git \
- pkg-config \
- python3 \
- python3-pip \
- python3-dev \
- python3-wheel \
- sudo \
- ssh \
- pbzip2 \
- pv \
- bzip2 \
- unzip \
- build-essential
-
-RUN cd /usr/local/bin &&\
- ln -s /usr/bin/python3 python &&\
- ln -s /usr/bin/pip3 pip
-RUN pip3 install --upgrade pip
-RUN pip3 install setuptools>=41.0.0
-
-# Install Cmake
-RUN cd /tmp && \
- wget https://github.com/Kitware/CMake/releases/download/v3.14.4/cmake-3.14.4-Linux-x86_64.sh && \
- chmod +x cmake-3.14.4-Linux-x86_64.sh && \
- ./cmake-3.14.4-Linux-x86_64.sh --prefix=/usr/local --exclude-subdir --skip-license && \
- rm ./cmake-3.14.4-Linux-x86_64.sh
-
-# Skip installing PyPI packages and NGC client on cross-build container
-
-COPY docker/jetpack_files /pdk_files
-COPY scripts/stubify.sh /pdk_files
-
-# Update CUDA signing keys
-RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub
-RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
-
-# Install CUDA cross compile toolchain
-RUN dpkg -i /pdk_files/cuda-repo-cross-aarch64*.deb /pdk_files/cuda-repo-ubuntu*_amd64.deb \
- && cp /var/cuda-repo-cross*/cuda-*-keyring.gpg /usr/share/keyrings/ \
- && cp /var/cuda-repo-ubuntu*/cuda-*-keyring.gpg /usr/share/keyrings/ \
- && apt-get update \
- && apt-get install -y cuda-cross-aarch64 \
- && rm -rf /var/lib/apt/lists/*
-
-# Unpack cudnn
-RUN dpkg -x /pdk_files/cudnn-local-tegra-repo*.deb /pdk_files/cudnn_extract \
- && dpkg -x /pdk_files/cudnn_extract/var/cudnn-local-tegra-repo*/libcudnn[7-8]_*-1+cuda11.[0-9]_arm64.deb /pdk_files/cudnn \
- && dpkg -x /pdk_files/cudnn_extract/var/cudnn-local-tegra-repo*/libcudnn[7-8]-dev_*-1+cuda11.[0-9]_arm64.deb /pdk_files/cudnn \
- && cd /pdk_files/cudnn/usr/lib/aarch64-linux-gnu \
- && cd /pdk_files/cudnn \
- && ln -s usr/include/aarch64-linux-gnu include \
- && ln -s usr/lib/aarch64-linux-gnu lib \
- && ln -s /pdk_files/cudnn/usr/include/aarch64-linux-gnu/cudnn_adv_infer_v[7-9].h /usr/include/cudnn_adv_infer.h \
- && ln -s /pdk_files/cudnn/usr/include/aarch64-linux-gnu/cudnn_adv_train_v[7-9].h /usr/include/cudnn_adv_train.h \
- && ln -s /pdk_files/cudnn/usr/include/aarch64-linux-gnu/cudnn_backend_v[7-9].h /usr/include/cudnn_backend.h \
- && ln -s /pdk_files/cudnn/usr/include/aarch64-linux-gnu/cudnn_cnn_infer_v[7-9].h /usr/include/cudnn_cnn_infer.h \
- && ln -s /pdk_files/cudnn/usr/include/aarch64-linux-gnu/cudnn_cnn_train_v[7-9].h /usr/include/cudnn_cnn_train.h \
- && ln -s /pdk_files/cudnn/usr/include/aarch64-linux-gnu/cudnn_ops_infer_v[7-9].h /usr/include/cudnn_ops_infer.h \
- && ln -s /pdk_files/cudnn/usr/include/aarch64-linux-gnu/cudnn_ops_train_v[7-9].h /usr/include/cudnn_ops_train.h \
- && ln -s /pdk_files/cudnn/usr/include/aarch64-linux-gnu/cudnn_v[7-9].h /usr/include/cudnn.h \
- && ln -s /pdk_files/cudnn/usr/include/aarch64-linux-gnu/cudnn_version_v[7-9].h /usr/include/cudnn_version.h
-
-# Unpack libnvinfer
-RUN dpkg -x /pdk_files/nv-tensorrt-local-repo-l4t-[0-8].[0-9].[0-9]-cuda-11.[0-9]_*_arm64.deb /pdk_files/tensorrt
-RUN mv /pdk_files/tensorrt/var/nv-tensorrt-local-repo-l4t-[0-8].[0-9].[0-9]-cuda-11.[0-9]/*.deb /pdk_files
-RUN dpkg -x /pdk_files/libnvinfer[0-8]_*-1+cuda11.[0-9]_arm64.deb /pdk_files/tensorrt \
- && dpkg -x /pdk_files/libnvinfer-dev_*-1+cuda11.[0-9]_arm64.deb /pdk_files/tensorrt \
- && dpkg -x /pdk_files/libnvparsers[6-8]_*-1+cuda11.[0-9]_arm64.deb /pdk_files/tensorrt \
- && dpkg -x /pdk_files/libnvparsers-dev_*-1+cuda11.[0-9]_arm64.deb /pdk_files/tensorrt \
- && dpkg -x /pdk_files/libnvinfer-plugin[6-8]_*-1+cuda11.[0-9]_arm64.deb /pdk_files/tensorrt \
- && dpkg -x /pdk_files/libnvinfer-plugin-dev_*-1+cuda11.[0-9]_arm64.deb /pdk_files/tensorrt \
- && dpkg -x /pdk_files/libnvonnxparsers[6-8]_*-1+cuda11.[0-9]_arm64.deb /pdk_files/tensorrt \
- && dpkg -x /pdk_files/libnvonnxparsers-dev_*-1+cuda11.[0-9]_arm64.deb /pdk_files/tensorrt
-
-# Clean up debs
-RUN rm -rf /pdk_files/*.deb
-
-# create stub libraries
-RUN cd /pdk_files/tensorrt \
- && ln -s usr/include/aarch64-linux-gnu include \
- && ln -s usr/lib/aarch64-linux-gnu lib \
- && cd lib \
- && mkdir stubs \
- && for x in nvinfer nvparsers nvinfer_plugin nvonnxparser; \
- do \
- CC=aarch64-linux-gnu-gcc /pdk_files/stubify.sh lib${x}.so stubs/lib${x}.so \
- ; done
-
-# Set environment and working directory
-ENV TRT_LIBPATH /pdk_files/tensorrt/lib
-ENV TRT_OSSPATH /workspace/TensorRT
-WORKDIR /workspace
-
-USER trtuser
-RUN ["/bin/bash"]
diff --git a/include/NvCaffeParser.h b/include/NvCaffeParser.h
deleted file mode 100644
index fc91e9b4..00000000
--- a/include/NvCaffeParser.h
+++ /dev/null
@@ -1,263 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#ifndef NV_CAFFE_PARSER_H
-#define NV_CAFFE_PARSER_H
-
-#include "NvInfer.h"
-
-//!
-//! \file NvCaffeParser.h
-//!
-//! This is the API for the Caffe Parser
-//!
-
-//!
-//! \namespace nvcaffeparser1
-//!
-//! \brief The TensorRT Caffe parser API namespace.
-//!
-namespace nvcaffeparser1
-{
-
-//!
-//! \class IBlobNameToTensor
-//!
-//! \brief Object used to store and query Tensors after they have been extracted from a Caffe model using the ICaffeParser.
-//!
-//! \note The lifetime of IBlobNameToTensor is the same as the lifetime of its parent ICaffeParser.
-//!
-//! \see nvcaffeparser1::ICaffeParser
-//!
-//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
-//!
-class IBlobNameToTensor
-{
-public:
- //! \brief Given a blob name, returns a pointer to a ITensor object.
- //!
- //! \param name Caffe blob name for which the user wants the corresponding ITensor.
- //!
- //! \return ITensor* corresponding to the queried name. If no such ITensor exists, then nullptr is returned.
- //!
- virtual nvinfer1::ITensor* find(char const* name) const noexcept = 0;
-
-protected:
- virtual ~IBlobNameToTensor() {}
-};
-
-//!
-//! \class IBinaryProtoBlob
-//!
-//! \brief Object used to store and query data extracted from a binaryproto file using the ICaffeParser.
-//!
-//! \see nvcaffeparser1::ICaffeParser
-//!
-//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
-//!
-class IBinaryProtoBlob
-{
-public:
- virtual void const* getData() noexcept = 0;
- virtual nvinfer1::Dims4 getDimensions() noexcept = 0;
- virtual nvinfer1::DataType getDataType() noexcept = 0;
- //!
- //! \deprecated Deprecated in TensorRT 8.0. Superseded by `delete`.
- //!
- //! \warning Calling destroy on a managed pointer will result in a double-free error.
- //!
- TRT_DEPRECATED virtual void destroy() noexcept = 0;
- virtual ~IBinaryProtoBlob() noexcept = default;
-};
-
-//!
-//! \class IPluginFactoryV2
-//!
-//! \brief Plugin factory used to configure plugins.
-//!
-class IPluginFactoryV2
-{
-public:
- //!
- //! \brief A user implemented function that determines if a layer configuration is provided by an IPluginV2.
- //!
- //! \param layerName Name of the layer which the user wishes to validate.
- //!
- virtual bool isPluginV2(char const* layerName) noexcept = 0;
-
- //!
- //! \brief Creates a plugin.
- //!
- //! \param layerName Name of layer associated with the plugin.
- //! \param weights Weights used for the layer.
- //! \param nbWeights Number of weights.
- //! \param libNamespace Library Namespace associated with the plugin object
- //!
- virtual nvinfer1::IPluginV2* createPlugin(char const* layerName, nvinfer1::Weights const* weights,
- int32_t nbWeights, char const* libNamespace = "") noexcept = 0;
-
- virtual ~IPluginFactoryV2() noexcept = default;
-};
-//!
-//! \class ICaffeParser
-//!
-//! \brief Class used for parsing Caffe models.
-//!
-//! Allows users to export models trained using Caffe to TRT.
-//!
-//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
-//!
-class ICaffeParser
-{
-public:
- //!
- //! \brief Parse a prototxt file and a binaryproto Caffe model to extract
- //! network definition and weights associated with the network, respectively.
- //!
- //! \param deploy The plain text, prototxt file used to define the network definition.
- //! \param model The binaryproto Caffe model that contains the weights associated with the network.
- //! \param network Network in which the CaffeParser will fill the layers.
- //! \param weightType The type to which the weights will transformed.
- //!
- //! \return A pointer to an IBlobNameToTensor object that contains the extracted data.
- //!
- //! \see nvcaffeparser1::IBlobNameToTensor
- //!
- virtual IBlobNameToTensor const* parse(char const* deploy, char const* model, nvinfer1::INetworkDefinition& network,
- nvinfer1::DataType weightType) noexcept = 0;
-
- //!
- //! \brief Parse a deploy prototxt and a binaryproto Caffe model from memory buffers to extract
- //! network definition and weights associated with the network, respectively.
- //!
- //! \param deployBuffer The plain text deploy prototxt used to define the network definition.
- //! \param deployLength The length of the deploy buffer.
- //! \param modelBuffer The binaryproto Caffe memory buffer that contains the weights associated with the network.
- //! \param modelLength The length of the model buffer.
- //! \param network Network in which the CaffeParser will fill the layers.
- //! \param weightType The type to which the weights will transformed.
- //!
- //! \return A pointer to an IBlobNameToTensor object that contains the extracted data.
- //!
- //! \see nvcaffeparser1::IBlobNameToTensor
- //!
- virtual IBlobNameToTensor const* parseBuffers(uint8_t const* deployBuffer, std::size_t deployLength,
- uint8_t const* modelBuffer, std::size_t modelLength, nvinfer1::INetworkDefinition& network,
- nvinfer1::DataType weightType) noexcept = 0;
-
- //!
- //! \brief Parse and extract data stored in binaryproto file.
- //!
- //! The binaryproto file contains data stored in a binary blob. parseBinaryProto() converts it
- //! to an IBinaryProtoBlob object which gives the user access to the data and meta-data about data.
- //!
- //! \param fileName Path to file containing binary proto.
- //!
- //! \return A pointer to an IBinaryProtoBlob object that contains the extracted data.
- //!
- //! \see nvcaffeparser1::IBinaryProtoBlob
- //!
- virtual IBinaryProtoBlob* parseBinaryProto(char const* fileName) noexcept = 0;
-
- //!
- //! \brief Set buffer size for the parsing and storage of the learned model.
- //!
- //! \param size The size of the buffer specified as the number of bytes.
- //!
- //! \note Default size is 2^30 bytes.
- //!
- virtual void setProtobufBufferSize(size_t size) noexcept = 0;
-
- //!
- //! \brief Destroy this ICaffeParser object.
- //!
- //! \deprecated Deprecated in TensorRT 8.0. Superseded by `delete`.
- //!
- //! \warning Calling destroy on a managed pointer will result in a double-free error.
- //!
- TRT_DEPRECATED virtual void destroy() noexcept = 0;
-
- //!
- //! \brief Set the IPluginFactoryV2 used to create the user defined pluginV2 objects.
- //!
- //! \param factory Pointer to an instance of the user implementation of IPluginFactoryV2.
- //!
- virtual void setPluginFactoryV2(IPluginFactoryV2* factory) noexcept = 0;
-
- //!
- //! \brief Set the namespace used to lookup and create plugins in the network.
- //!
- virtual void setPluginNamespace(char const* libNamespace) noexcept = 0;
-
- virtual ~ICaffeParser() noexcept = default;
-
-public:
- //!
- //! \brief Set the ErrorRecorder for this interface
- //!
- //! Assigns the ErrorRecorder to this interface. The ErrorRecorder will track all errors during execution.
- //! This function will call incRefCount of the registered ErrorRecorder at least once. Setting
- //! recorder to nullptr unregisters the recorder with the interface, resulting in a call to decRefCount if
- //! a recorder has been registered.
- //!
- //! If an error recorder is not set, messages will be sent to the global log stream.
- //!
- //! \param recorder The error recorder to register with this interface.
- //!
- //! \see getErrorRecorder()
- //!
- virtual void setErrorRecorder(nvinfer1::IErrorRecorder* recorder) noexcept = 0;
-
- //!
- //! \brief get the ErrorRecorder assigned to this interface.
- //!
- //! Retrieves the assigned error recorder object for the given class. A
- //! nullptr will be returned if setErrorRecorder has not been called.
- //!
- //! \return A pointer to the IErrorRecorder object that has been registered.
- //!
- //! \see setErrorRecorder()
- //!
- virtual nvinfer1::IErrorRecorder* getErrorRecorder() const noexcept = 0;
-};
-
-//!
-//! \brief Creates a ICaffeParser object.
-//!
-//! \return A pointer to the ICaffeParser object is returned.
-//!
-//! \see nvcaffeparser1::ICaffeParser
-//!
-//! \deprecated ICaffeParser will be removed in TensorRT 9.0. Plan to migrate your workflow to
-//! use nvonnxparser::IParser for deployment.
-//!
-TENSORRTAPI ICaffeParser* createCaffeParser() noexcept;
-
-//!
-//! \brief Shuts down protocol buffers library.
-//!
-//! \note No part of the protocol buffers library can be used after this function is called.
-//!
-TENSORRTAPI void shutdownProtobufLibrary() noexcept;
-} // namespace nvcaffeparser1
-
-//!
-//! Internal C entry point for creating ICaffeParser.
-//! @private
-//!
-extern "C" TENSORRTAPI void* createNvCaffeParser_INTERNAL() noexcept;
-#endif
diff --git a/include/NvInfer.h b/include/NvInfer.h
index 63c0b7f8..7fff86b1 100644
--- a/include/NvInfer.h
+++ b/include/NvInfer.h
@@ -57,7 +57,7 @@ namespace nvinfer1
enum class LayerType : int32_t
{
kCONVOLUTION = 0, //!< Convolution layer.
- kFULLY_CONNECTED = 1, //!< Fully connected layer.
+ kCAST = 1, //!< Cast layer
kACTIVATION = 2, //!< Activation layer.
kPOOLING = 3, //!< Pooling layer.
kLRN = 4, //!< LRN layer.
@@ -76,34 +76,33 @@ enum class LayerType : int32_t
kMATRIX_MULTIPLY = 17, //!< Matrix multiply layer.
kRAGGED_SOFTMAX = 18, //!< Ragged softmax layer.
kCONSTANT = 19, //!< Constant layer.
- kRNN_V2 = 20, //!< RNNv2 layer.
- kIDENTITY = 21, //!< Identity layer.
- kPLUGIN_V2 = 22, //!< PluginV2 layer.
- kSLICE = 23, //!< Slice layer.
- kSHAPE = 24, //!< Shape layer.
- kPARAMETRIC_RELU = 25, //!< Parametric ReLU layer.
- kRESIZE = 26, //!< Resize Layer.
- kTRIP_LIMIT = 27, //!< Loop Trip limit layer
- kRECURRENCE = 28, //!< Loop Recurrence layer
- kITERATOR = 29, //!< Loop Iterator layer
- kLOOP_OUTPUT = 30, //!< Loop output layer
- kSELECT = 31, //!< Select layer.
- kFILL = 32, //!< Fill layer
- kQUANTIZE = 33, //!< Quantize layer
- kDEQUANTIZE = 34, //!< Dequantize layer
- kCONDITION = 35, //!< Condition layer
- kCONDITIONAL_INPUT = 36, //!< Conditional Input layer
- kCONDITIONAL_OUTPUT = 37, //!< Conditional Output layer
- kSCATTER = 38, //!< Scatter layer
- kEINSUM = 39, //!< Einsum layer
- kASSERTION = 40, //!< Assertion layer
- kONE_HOT = 41, //!< OneHot layer
- kNON_ZERO = 42, //!< NonZero layer
- kGRID_SAMPLE = 43, //!< Grid sample layer
- kNMS = 44, //!< NMS layer
- kREVERSE_SEQUENCE = 45, //!< Reverse sequence layer
- kNORMALIZATION = 46, //!< Normalization layer
- kCAST = 47, //!< Cast layer
+ kIDENTITY = 20, //!< Identity layer.
+ kPLUGIN_V2 = 21, //!< PluginV2 layer.
+ kSLICE = 22, //!< Slice layer.
+ kSHAPE = 23, //!< Shape layer.
+ kPARAMETRIC_RELU = 24, //!< Parametric ReLU layer.
+ kRESIZE = 25, //!< Resize Layer.
+ kTRIP_LIMIT = 26, //!< Loop Trip limit layer
+ kRECURRENCE = 27, //!< Loop Recurrence layer
+ kITERATOR = 28, //!< Loop Iterator layer
+ kLOOP_OUTPUT = 29, //!< Loop output layer
+ kSELECT = 30, //!< Select layer.
+ kFILL = 31, //!< Fill layer
+ kQUANTIZE = 32, //!< Quantize layer
+ kDEQUANTIZE = 33, //!< Dequantize layer
+ kCONDITION = 34, //!< Condition layer
+ kCONDITIONAL_INPUT = 35, //!< Conditional Input layer
+ kCONDITIONAL_OUTPUT = 36, //!< Conditional Output layer
+ kSCATTER = 37, //!< Scatter layer
+ kEINSUM = 38, //!< Einsum layer
+ kASSERTION = 39, //!< Assertion layer
+ kONE_HOT = 40, //!< OneHot layer
+ kNON_ZERO = 41, //!< NonZero layer
+ kGRID_SAMPLE = 42, //!< Grid sample layer
+ kNMS = 43, //!< NMS layer
+ kREVERSE_SEQUENCE = 44, //!< Reverse sequence layer
+ kNORMALIZATION = 45, //!< Normalization layer
+ kPLUGIN_V3 = 46 //!< PluginV3 layer.
};
//!
@@ -114,7 +113,7 @@ enum class LayerType : int32_t
template <>
constexpr inline int32_t EnumMax() noexcept
{
- return 48;
+ return 47;
}
//!
@@ -132,18 +131,20 @@ using TensorFormats = uint32_t;
//!
enum class ActivationType : int32_t
{
- kRELU = 0, //!< Rectified linear activation.
- kSIGMOID = 1, //!< Sigmoid activation.
- kTANH = 2, //!< TanH activation.
- kLEAKY_RELU = 3, //!< LeakyRelu activation: x>=0 ? x : alpha * x.
- kELU = 4, //!< Elu activation: x>=0 ? x : alpha * (exp(x) - 1).
- kSELU = 5, //!< Selu activation: x>0 ? beta * x : beta * (alpha*exp(x) - alpha)
- kSOFTSIGN = 6, //!< Softsign activation: x / (1+|x|)
- kSOFTPLUS = 7, //!< Parametric softplus activation: alpha*log(exp(beta*x)+1)
- kCLIP = 8, //!< Clip activation: max(alpha, min(beta, x))
- kHARD_SIGMOID = 9, //!< Hard sigmoid activation: max(0, min(1, alpha*x+beta))
- kSCALED_TANH = 10, //!< Scaled tanh activation: alpha*tanh(beta*x)
- kTHRESHOLDED_RELU = 11 //!< Thresholded ReLU activation: x>alpha ? x : 0
+ kRELU = 0, //!< Rectified linear activation.
+ kSIGMOID = 1, //!< Sigmoid activation.
+ kTANH = 2, //!< TanH activation.
+ kLEAKY_RELU = 3, //!< LeakyRelu activation: x>=0 ? x : alpha * x.
+ kELU = 4, //!< Elu activation: x>=0 ? x : alpha * (exp(x) - 1).
+ kSELU = 5, //!< Selu activation: x>0 ? beta * x : beta * (alpha*exp(x) - alpha)
+ kSOFTSIGN = 6, //!< Softsign activation: x / (1+|x|)
+ kSOFTPLUS = 7, //!< Parametric softplus activation: alpha*log(exp(beta*x)+1)
+ kCLIP = 8, //!< Clip activation: max(alpha, min(beta, x))
+ kHARD_SIGMOID = 9, //!< Hard sigmoid activation: max(0, min(1, alpha*x+beta))
+ kSCALED_TANH = 10, //!< Scaled tanh activation: alpha*tanh(beta*x)
+ kTHRESHOLDED_RELU = 11, //!< Thresholded ReLU activation: x>alpha ? x : 0
+ kGELU_ERF = 12, //!< GELU erf activation: 0.5 * x * (1 + erf(sqrt(0.5) * x))
+ kGELU_TANH = 13 //!< GELU tanh activation: 0.5 * x * (1 + tanh(sqrt(2/pi) * (0.044715F * pow(x, 3) + x)))
};
namespace impl
@@ -156,7 +157,7 @@ namespace impl
template <>
struct EnumMaxImpl
{
- static constexpr int32_t kVALUE = 12;
+ static constexpr int32_t kVALUE = 14;
};
} // namespace impl
@@ -224,7 +225,7 @@ class ITensor : public INoCopy
//!
//! \see getDimensions()
//!
- void setDimensions(Dims dimensions) noexcept
+ void setDimensions(Dims const& dimensions) noexcept
{
mImpl->setDimensions(dimensions);
}
@@ -235,6 +236,7 @@ class ITensor : public INoCopy
//! \return The dimensions of the tensor.
//!
//! \warning getDimensions() returns a -1 for dimensions that are derived from a wildcard dimension.
+ //!
//! \see setDimensions()
//!
Dims getDimensions() const noexcept
@@ -301,46 +303,41 @@ class ITensor : public INoCopy
}
//!
- //! \brief Set whether to enable broadcast of tensor across the batch.
- //!
- //! When a tensor is broadcast across a batch, it has the same value for every member in the batch.
- //! Memory is only allocated once for the single member.
- //!
- //! This method is only valid for network input tensors, since the flags of layer output tensors are inferred based
- //! on layer inputs and parameters.
- //! If this state is modified for a tensor in the network, the states of all dependent tensors will be recomputed.
- //! If the tensor is for an explicit batch network, then this function does nothing.
+ //! \brief Set whether to enable broadcast of tensor across the implicit batch dimension.
//!
- //! \warning The broadcast flag is ignored when using explicit batch network mode.
+ //! \warning This method has no effect other than issuing a warning.
//!
- //! \param broadcastAcrossBatch Whether to enable broadcast of tensor across the batch.
+ //! \param broadcastAcrossBatch Whether to broadcast the tensor across the implicit
+ //! batch dimension that was a feature of TensorRT 9.x and prior.
//!
//! \see getBroadcastAcrossBatch()
//!
- void setBroadcastAcrossBatch(bool broadcastAcrossBatch) noexcept
+ //! \deprecated Deprecated in TensorRT 10.0. Implicit batch is not supported since TensorRT 10.0.
+ //!
+ TRT_DEPRECATED void setBroadcastAcrossBatch(bool broadcastAcrossBatch) noexcept
{
mImpl->setBroadcastAcrossBatch(broadcastAcrossBatch);
}
//!
- //! \brief Check if tensor is broadcast across the batch.
- //!
- //! When a tensor is broadcast across a batch, it has the same value for every member in the batch.
- //! Memory is only allocated once for the single member. If the network is in explicit batch mode,
- //! this function returns true if the leading dimension is 1.
+ //! \brief Check if tensor is broadcast across the implicit batch dimension.
//!
- //! \return True if tensor is broadcast across the batch, false otherwise.
+ //! \return Always false since TensorRT 10.0 does not support an implicit batch dimension.
//!
//! \see setBroadcastAcrossBatch()
//!
- bool getBroadcastAcrossBatch() const noexcept
+ //! \deprecated Deprecated in TensorRT 10.0. Implicit batch is not supported since TensorRT 10.0.
+ //!
+ TRT_DEPRECATED bool getBroadcastAcrossBatch() const noexcept
{
return mImpl->getBroadcastAcrossBatch();
}
//!
//! \brief Get the storage location of a tensor.
+ //!
//! \return The location of tensor data.
+ //!
//! \see setLocation()
//!
TensorLocation getLocation() const noexcept
@@ -350,6 +347,7 @@ class ITensor : public INoCopy
//!
//! \brief Set the storage location of a tensor
+ //!
//! \param location the location of tensor data
//!
//! Only network input tensors for storing sequence lengths for RNNv2 are supported.
@@ -358,7 +356,10 @@ class ITensor : public INoCopy
//!
//! \see getLocation()
//!
- void setLocation(TensorLocation location) noexcept
+ //! \deprecated Deprecated in TensorRT 10.0. RNNv2 is not supported and the location must
+ //! always be TensorLocation::kDEVICE since TensorRT 10.0.
+ //!
+ TRT_DEPRECATED void setLocation(TensorLocation location) noexcept
{
mImpl->setLocation(location);
}
@@ -403,7 +404,7 @@ class ITensor : public INoCopy
//!
//! \brief Set allowed formats for this tensor. By default all formats are allowed.
- //! Shape tensors (for which isShapeTensor() returns true) may only have row major linear format.
+ //! Shape tensors (for which isShapeTensor() returns true) may only have row-major linear format.
//!
//! When running network on DLA and the build option kGPU_FALLBACK is not specified, if DLA format(kCHW4 with Int8,
//! kCHW4 with FP16, kCHW16 with FP16, kCHW32 with Int8) is set, the input format is treated as native DLA format with
@@ -413,6 +414,7 @@ class ITensor : public INoCopy
//! \param formats A bitmask of TensorFormat values that are supported for this tensor.
//!
//! \see ITensor::getAllowedFormats()
+ //!
//! \see TensorFormats
//!
void setAllowedFormats(TensorFormats formats) noexcept
@@ -422,7 +424,7 @@ class ITensor : public INoCopy
//!
//! \brief Get a bitmask of TensorFormat values that the tensor supports.
- //! For a shape tensor, only row major linear format is allowed.
+ //! For a shape tensor, only row-major linear format is allowed.
//!
//! \return The value specified by setAllowedFormats or all possible formats.
//!
@@ -437,7 +439,7 @@ class ITensor : public INoCopy
//! \brief Whether the tensor is a shape tensor.
//!
//! A shape tensor is a tensor that is related to shape calculations.
- //! It must have type Int32, Bool, or Float, and its shape must be determinable at build time.
+ //! It must have type Int32, Int64, Bool, or Float, and its shape must be determinable at build time.
//! Furthermore, it must be needed as a shape tensor, either marked as a network shape
//! output via markOutputForShapes(), or as a layer input that is required to be a shape
//! tensor, such as the second input to IShuffleLayer. Some layers are "polymorphic" in
@@ -453,15 +455,11 @@ class ITensor : public INoCopy
//! cause all three tensors to be shape tensors, because IShuffleLayer requires that its
//! second optional input be a shape tensor, and IElementWiseLayer is "polymorphic".
//!
- //! If a tensor is a shape tensor and becomes an engine input or output,
- //! then ICudaEngine::isShapeBinding will be true for that tensor.
- //! Such a shape tensor must have type Int32.
- //!
//! It is possible for a tensor to be both a shape tensor and an execution tensor.
//!
//! \return True if tensor is a shape tensor, false otherwise.
//!
- //! \see INetworkDefinition::markOutputForShapes(), ICudaEngine::isShapeBinding()
+ //! \see INetworkDefinition::markOutputForShapes()
//!
bool isShapeTensor() const noexcept
{
@@ -478,8 +476,6 @@ class ITensor : public INoCopy
//! For example, if a partially built network has no path from a tensor to a network output,
//! isExecutionTensor() returns false. Completing the path would cause it to become true.
//!
- //! If a tensor is an execution tensor and becomes an engine input or output,
- //! then ICudaEngine::isExecutionBinding will be true for that tensor.
//!
//! A tensor with isShapeTensor() == false and isExecutionTensor() == false
//! can still show up as an input to the engine if its dimensions are required.
@@ -595,7 +591,7 @@ class ILayer : public INoCopy
//! \param index The index of the input tensor.
//!
//! \return The input tensor, or nullptr if the index is out of range or the tensor is optional
- //! (\ref ISliceLayer and \ref IRNNv2Layer).
+ //! (\ref ISliceLayer).
//!
ITensor* getInput(int32_t index) const noexcept
{
@@ -613,8 +609,7 @@ class ILayer : public INoCopy
//!
//! \brief Get the layer output corresponding to the given index.
//!
- //! \return The indexed output tensor, or nullptr if the index is out of range or the tensor is optional
- //! (\ref IRNNv2Layer).
+ //! \return The indexed output tensor, or nullptr if the index is out of range or the tensor is optional.
//!
ITensor* getOutput(int32_t index) const noexcept
{
@@ -639,9 +634,9 @@ class ILayer : public INoCopy
}
//!
- //! \brief Set the computational precision of this layer
+ //! \brief Set the preferred or required computational precision of this layer in a weakly-typed network.
//!
- //! Setting the precision allows TensorRT to choose an implementation which run at this computational precision.
+ //! Setting the precision directs TensorRT to choose an implementation that runs at this computational precision.
//! TensorRT could still choose a non-conforming fastest implementation that ignores the requested precision.
//! To force choosing an implementation with the requested precision, set exactly one of the following flags,
//! which differ in what happens if no such implementation exists:
@@ -657,6 +652,10 @@ class ILayer : public INoCopy
//! For a IIdentityLayer: If it casts to/from float/half/int8/uint8, the precision must be one of those types,
//! otherwise it must be either the input or output type.
//!
+ //! Strongly-typed networks reject calls to method setPrecision. In strongly-typed networks, the computation
+ //! precision is typically controlled by casting the input tensors to the desired type. The exception is
+ //! INormalizationLayer, which has a method setComputePrecision().
+ //!
//! \param dataType the computational precision.
//!
//! \see getPrecision() precisionIsSet() resetPrecision()
@@ -701,12 +700,13 @@ class ILayer : public INoCopy
}
//!
- //! \brief Set the output type of this layer
+ //! \brief Set the output type of this layer in a weakly-typed network.
//!
//! Setting the output type constrains TensorRT to choose implementations which generate output data with the
//! given type. If it is not set, TensorRT will select output type based on layer computational precision. TensorRT
//! could still choose non-conforming output type based on fastest implementation. To force choosing the requested
- //! output type, set exactly one of the following flags, which differ in what happens if no such implementation exists:
+ //! output type, set exactly one of the following flags, which differ in what happens if no such implementation
+ //! exists:
//!
//! * BuilderFlag::kOBEY_PRECISION_CONSTRAINTS - build fails with an error message.
//!
@@ -728,6 +728,14 @@ class ILayer : public INoCopy
//! is marked as a network output, since only setType() [but not setOutputType()] will affect the data
//! representation in the corresponding output binding.
//!
+ //! Strongly-typed networks reject calls to method setOutputType. Instead, the output type can be set
+ //! only for layers that define method setToType(). Those layers are:
+ //!
+ //! * ICastLayer
+ //! * IDequantizeLayer
+ //! * IFillLayer
+ //! * IQuantizeLayer
+ //!
//! \param index the index of the output to set
//! \param dataType the type of the output
//!
@@ -742,6 +750,7 @@ class ILayer : public INoCopy
//! \brief get the output type of this layer
//!
//! \param index the index of the output
+ //!
//! \return the output precision. If no precision has been set, DataType::kFLOAT will be returned,
//! unless the output type is inherently DataType::kINT32.
//!
@@ -756,6 +765,7 @@ class ILayer : public INoCopy
//! \brief whether the output type has been set for this layer
//!
//! \param index the index of the output
+ //!
//! \return whether the output type has been explicitly set
//!
//! \see setOutputType() getOutputType() resetOutputType()
@@ -819,8 +829,8 @@ class ILayer : public INoCopy
//! \brief Enumerates the modes of padding to perform in convolution, deconvolution and pooling layer,
//! padding mode takes precedence if setPaddingMode() and setPrePadding() are also used.
//!
-//! There are three padding styles, EXPLICIT, SAME, and CAFFE, with each style having two variants.
-//! The EXPLICIT and CAFFE styles determine if the final sampling location is used or not.
+//! There are two padding styles, EXPLICIT and SAME with each style having two variants.
+//! The EXPLICIT style determine if the final sampling location is used or not.
//! The SAME style determine if the asymmetry in the padding is on the pre or post padding.
//!
//! \code
@@ -842,18 +852,10 @@ class ILayer : public INoCopy
//! \code
//! O = floor((M - DK) / S) + 1
//! \endcode
-//! - CAFFE_ROUND_DOWN:
-//! \code
-//! O = floor((I + B * 2 - DK) / S) + 1
-//! \endcode
//! - EXPLICIT_ROUND_UP:
//! \code
//! O = ceil((M - DK) / S) + 1
//! \endcode
-//! - CAFFE_ROUND_UP:
-//! \code
-//! O = ceil((I + B * 2 - DK) / S) + 1
-//! \endcode
//! - SAME_UPPER:
//! \code
//! O = ceil(I / S)
@@ -871,9 +873,7 @@ class ILayer : public INoCopy
//!
//! Formulas for Deconvolution:
//! - EXPLICIT_ROUND_DOWN:
-//! - CAFFE_ROUND_DOWN:
//! - EXPLICIT_ROUND_UP:
-//! - CAFFE_ROUND_UP:
//! \code
//! O = (I - 1) * S + DK - (B + A)
//! \endcode
@@ -915,14 +915,6 @@ class ILayer : public INoCopy
//! A = floor(P / 2)
//! B = P - A
//! \endcode
-//! - CAFFE_ROUND_DOWN:
-//! \code
-//! EXPLICIT_ROUND_DOWN - ((EXPLICIT_ROUND_DOWN - 1) * S >= I + B)
-//! \endcode
-//! - CAFFE_ROUND_UP:
-//! \code
-//! EXPLICIT_ROUND_UP - ((EXPLICIT_ROUND_UP - 1) * S >= I + B)
-//! \endcode
//!
//! Pooling Example 1:
//! \code
@@ -987,62 +979,12 @@ class ILayer : public INoCopy
//! Given I = {6, 6}, B = {3, 3}, A = {3, 3}, S = {2, 2}, F = {3, 3}. What is O?
//! \endcode
//!
-//! - CAFFE_ROUND_DOWN:
-//! \code
-//! Computation:
-//! M = {6, 6} + {3, 3} + {3, 3} ==> {12, 12}
-//! EXPLICIT_ROUND_DOWN ==> floor((M - F) / S) + 1
-//! ==> floor(({12, 12} - {3, 3}) / {2, 2}) + {1, 1}
-//! ==> {5, 5}
-//! DIFF = (((EXPLICIT_ROUND_DOWN - 1) * S >= I + B) ? {1, 1} : {0, 0})
-//! ==> ({5, 5} - {1, 1}) * {2, 2} >= {6, 6} + {3, 3} ? {1, 1} : {0,0}
-//! ==> {0, 0}
-//! O ==> EXPLICIT_ROUND_DOWN - DIFF
-//! ==> {5, 5} - {0, 0}
-//! ==> {5, 5}
-//! \endcode
-//! - CAFFE_ROUND_UP:
-//! \code
-//! Computation:
-//! M = {6, 6} + {3, 3} + {3, 3} ==> {12, 12}
-//! EXPLICIT_ROUND_UP ==> ceil((M - F) / S) + 1
-//! ==> ceil(({12, 12} - {3, 3}) / {2, 2}) + {1, 1}
-//! ==> {6, 6}
-//! DIFF = (((EXPLICIT_ROUND_UP - 1) * S >= I + B) ? {1, 1} : {0, 0})
-//! ==> ({6, 6} - {1, 1}) * {2, 2} >= {6, 6} + {3, 3} ? {1, 1} : {0,0}
-//! ==> {1, 1}
-//! O ==> EXPLICIT_ROUND_UP - DIFF
-//! ==> {6, 6} - {1, 1}
-//! ==> {5, 5}
-//! \endcode
-//!
-//! The sample points are {0, 2, 4, 6, 8} in each dimension.
-//! CAFFE_ROUND_DOWN and CAFFE_ROUND_UP have two restrictions each on usage with pooling operations.
-//! This will cause getDimensions to return an empty dimension and also to reject the network
-//! at validation time.
-//! For more information on original reference code, see
-//! https://github.com/BVLC/caffe/blob/master/src/caffe/layers/pooling_layer.cpp
-//!
-//! - Restriction 1:
-//! \code
-//! CAFFE_ROUND_DOWN: B >= F is an error if (B - S) < F
-//! CAFFE_ROUND_UP: (B + S) >= (F + 1) is an error if B < (F + 1)
-//! \endcode
-//!
-//! - Restriction 2:
-//! \code
-//! CAFFE_ROUND_DOWN: (B - S) >= F is an error if B >= F
-//! CAFFE_ROUND_UP: B >= (F + 1) is an error if (B + S) >= (F + 1)
-//! \endcode
-//!
enum class PaddingMode : int32_t
{
kEXPLICIT_ROUND_DOWN = 0, //!< Use explicit padding, rounding output size down.
kEXPLICIT_ROUND_UP = 1, //!< Use explicit padding, rounding output size up.
kSAME_UPPER = 2, //!< Use SAME padding, with prePadding <= postPadding.
kSAME_LOWER = 3, //!< Use SAME padding, with prePadding >= postPadding.
- kCAFFE_ROUND_DOWN = 4, //!< Use CAFFE padding, rounding output size down, uses prePadding value.
- kCAFFE_ROUND_UP = 5 //!< Use CAFFE padding, rounding output size up, uses prePadding value.
};
namespace impl
@@ -1055,7 +997,7 @@ namespace impl
template <>
struct EnumMaxImpl
{
- static constexpr int32_t kVALUE = 6;
+ static constexpr int32_t kVALUE = 4;
};
} // namespace impl
@@ -1074,32 +1016,6 @@ struct EnumMaxImpl
class IConvolutionLayer : public ILayer
{
public:
- //!
- //! \brief Set the HW kernel size of the convolution.
- //!
- //! If executing this layer on DLA, both height and width of kernel size must be in the range [1,32].
- //!
- //! \see getKernelSize()
- //!
- //! \deprecated Superseded by setKernelSizeNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
- //!
- TRT_DEPRECATED void setKernelSize(DimsHW kernelSize) noexcept
- {
- mImpl->setKernelSize(kernelSize);
- }
-
- //!
- //! \brief Get the HW kernel size of the convolution.
- //!
- //! \see setKernelSize()
- //!
- //! \deprecated Superseded by getKernelSizeNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
- //!
- TRT_DEPRECATED DimsHW getKernelSize() const noexcept
- {
- return mImpl->getKernelSize();
- }
-
//!
//! \brief Set the number of output maps for the convolution.
//!
@@ -1107,7 +1023,7 @@ class IConvolutionLayer : public ILayer
//!
//! \see getNbOutputMaps()
//!
- void setNbOutputMaps(int32_t nbOutputMaps) noexcept
+ void setNbOutputMaps(int64_t nbOutputMaps) noexcept
{
mImpl->setNbOutputMaps(nbOutputMaps);
}
@@ -1117,69 +1033,11 @@ class IConvolutionLayer : public ILayer
//!
//! \see setNbOutputMaps()
//!
- int32_t getNbOutputMaps() const noexcept
+ int64_t getNbOutputMaps() const noexcept
{
return mImpl->getNbOutputMaps();
}
- //!
- //! \brief Get the stride of the convolution.
- //!
- //! Default: (1,1)
- //!
- //! If executing this layer on DLA, both height and width of stride must be in the range [1,8].
- //!
- //! \see getStride()
- //!
- //! \deprecated Superseded by setStrideNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
- //!
- TRT_DEPRECATED void setStride(DimsHW stride) noexcept
- {
- mImpl->setStride(stride);
- }
-
- //!
- //! \brief Get the stride of the convolution.
- //!
- //! \deprecated Superseded by getStrideNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
- //!
- TRT_DEPRECATED DimsHW getStride() const noexcept
- {
- return mImpl->getStride();
- }
-
- //!
- //! \brief Set the padding of the convolution.
- //!
- //! The input will be zero-padded by this number of elements in the height and width directions.
- //! Padding is symmetric.
- //!
- //! Default: (0,0)
- //!
- //! If executing this layer on DLA, both height and width of padding must be in the range [0,31],
- //! and the padding size must be less than the kernel size.
- //!
- //! \see getPadding()
- //!
- //! \deprecated Superseded by setPaddingNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
- //!
- TRT_DEPRECATED void setPadding(DimsHW padding) noexcept
- {
- return mImpl->setPadding(padding);
- }
-
- //!
- //! \brief Get the padding of the convolution. If the padding is asymmetric, the pre-padding is returned.
- //!
- //! \see setPadding()
- //!
- //! \deprecated Superseded by getPaddingNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
- //!
- TRT_DEPRECATED DimsHW getPadding() const noexcept
- {
- return mImpl->getPadding();
- }
-
//!
//! \brief Set the number of groups for a convolution.
//!
@@ -1195,7 +1053,7 @@ class IConvolutionLayer : public ILayer
//!
//! \see getNbGroups()
//!
- void setNbGroups(int32_t nbGroups) noexcept
+ void setNbGroups(int64_t nbGroups) noexcept
{
mImpl->setNbGroups(nbGroups);
}
@@ -1205,7 +1063,7 @@ class IConvolutionLayer : public ILayer
//!
//! \see setNbGroups()
//!
- int32_t getNbGroups() const noexcept
+ int64_t getNbGroups() const noexcept
{
return mImpl->getNbGroups();
}
@@ -1259,34 +1117,6 @@ class IConvolutionLayer : public ILayer
return mImpl->getBiasWeights();
}
- //!
- //! \brief Set the dilation for a convolution.
- //!
- //! Default: (1,1)
- //!
- //! If executing this layer on DLA, both height and width must be in the range [1,32].
- //!
- //! \see getDilation()
- //!
- //! \deprecated Superseded by setDilationNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
- //!
- TRT_DEPRECATED void setDilation(DimsHW dilation) noexcept
- {
- return mImpl->setDilation(dilation);
- }
-
- //!
- //! \brief Get the dilation for a convolution.
- //!
- //! \see setDilation()
- //!
- //! \deprecated Superseded by getDilationNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
- //!
- TRT_DEPRECATED DimsHW getDilation() const noexcept
- {
- return mImpl->getDilation();
- }
-
//!
//! \brief Set the multi-dimension pre-padding of the convolution.
//!
@@ -1299,7 +1129,7 @@ class IConvolutionLayer : public ILayer
//!
//! \see getPrePadding()
//!
- void setPrePadding(Dims padding) noexcept
+ void setPrePadding(Dims const& padding) noexcept
{
mImpl->setPrePadding(padding);
}
@@ -1326,7 +1156,7 @@ class IConvolutionLayer : public ILayer
//!
//! \see getPostPadding()
//!
- void setPostPadding(Dims padding) noexcept
+ void setPostPadding(Dims const& padding) noexcept
{
mImpl->setPostPadding(padding);
}
@@ -1375,7 +1205,7 @@ class IConvolutionLayer : public ILayer
//!
//! \see getKernelSizeNd()
//!
- void setKernelSizeNd(Dims kernelSize) noexcept
+ void setKernelSizeNd(Dims const& kernelSize) noexcept
{
mImpl->setKernelSizeNd(kernelSize);
}
@@ -1398,9 +1228,9 @@ class IConvolutionLayer : public ILayer
//! If executing this layer on DLA, only support 2D stride, both height and width of stride must be in the range
//! [1,8].
//!
- //! \see getStrideNd() setStride() getStride()
+ //! \see getStrideNd()
//!
- void setStrideNd(Dims stride) noexcept
+ void setStrideNd(Dims const& stride) noexcept
{
mImpl->setStrideNd(stride);
}
@@ -1428,7 +1258,7 @@ class IConvolutionLayer : public ILayer
//!
//! \see getPaddingNd() setPadding() getPadding()
//!
- void setPaddingNd(Dims padding) noexcept
+ void setPaddingNd(Dims const& padding) noexcept
{
mImpl->setPaddingNd(padding);
}
@@ -1454,7 +1284,7 @@ class IConvolutionLayer : public ILayer
//!
//! \see getDilation()
//!
- void setDilationNd(Dims dilation) noexcept
+ void setDilationNd(Dims const& dilation) noexcept
{
mImpl->setDilationNd(dilation);
}
@@ -1480,6 +1310,7 @@ class IConvolutionLayer : public ILayer
//! Input 0 is the input activation tensor.
//! Input 1 is the kernel tensor. If used, the kernel weights parameter must be set to empty weights.
//! Input 2 is the bias tensor. If used, the bias parameter must be set to empty weights.
+ //!
//! \see getKernelWeights(), setKernelWeights(), getBiasWeights(), setBiasWeights()
//!
using ILayer::setInput;
@@ -1489,132 +1320,6 @@ class IConvolutionLayer : public ILayer
apiv::VConvolutionLayer* mImpl;
};
-//! \class IFullyConnectedLayer
-//!
-//! \brief A fully connected layer in a network definition.
-//! This layer expects an input tensor of three or more non-batch dimensions. The input is automatically
-//! reshaped into an `MxV` tensor `X`, where `V` is a product of the last three dimensions and `M`
-//! is a product of the remaining dimensions (where the product over 0 dimensions is defined as 1). For example:
-//!
-//! - If the input tensor has shape `{C, H, W}`, then the tensor is reshaped into `{1, C*H*W}`.
-//! - If the input tensor has shape `{P, C, H, W}`, then the tensor is reshaped into `{P, C*H*W}`.
-//!
-//! The layer then performs the following operation:
-//!
-//! ~~~
-//! Y := matmul(X, W^T) + bias
-//! ~~~
-//!
-//! Where `X` is the `MxV` tensor defined above, `W` is the `KxV` weight tensor
-//! of the layer, and `bias` is a row vector size `K` that is broadcasted to
-//! `MxK`. `K` is the number of output channels, and configurable via
-//! setNbOutputChannels(). If `bias` is not specified, it is implicitly `0`.
-//!
-//! The `MxK` result `Y` is then reshaped such that the last three dimensions are `{K, 1, 1}` and
-//! the remaining dimensions match the dimensions of the input tensor. For example:
-//!
-//! - If the input tensor has shape `{C, H, W}`, then the output tensor will have shape `{K, 1, 1}`.
-//! - If the input tensor has shape `{P, C, H, W}`, then the output tensor will have shape `{P, K, 1, 1}`.
-//!
-//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
-//!
-//! \deprecated Deprecated in TensorRT 8.4. Superseded by IMatrixMultiplyLayer.
-//!
-class TRT_DEPRECATED IFullyConnectedLayer : public ILayer
-{
-public:
- //!
- //! \brief Set the number of output channels `K` from the fully connected layer.
- //!
- //! If executing this layer on DLA, number of output channels must in the range [1,8192].
- //!
- //! \see getNbOutputChannels()
- //!
- void setNbOutputChannels(int32_t nbOutputs) noexcept
- {
- mImpl->setNbOutputChannels(nbOutputs);
- }
-
- //!
- //! \brief Get the number of output channels `K` from the fully connected layer.
- //!
- //! \see setNbOutputChannels()
- //!
- int32_t getNbOutputChannels() const noexcept
- {
- return mImpl->getNbOutputChannels();
- }
-
- //!
- //! \brief Set the kernel weights, given as a `KxC` matrix in row-major order.
- //!
- //! \see getKernelWeights()
- //!
- void setKernelWeights(Weights weights) noexcept
- {
- mImpl->setKernelWeights(weights);
- }
-
- //!
- //! \brief Get the kernel weights.
- //!
- //! \see setKernelWeights()
- //!
- Weights getKernelWeights() const noexcept
- {
- return mImpl->getKernelWeights();
- }
-
- //!
- //! \brief Set the bias weights.
- //!
- //! Bias is optional. To omit bias, set the count value in the weights structure to zero.
- //!
- //! \see getBiasWeightsWeights()
- //!
- void setBiasWeights(Weights weights) noexcept
- {
- mImpl->setBiasWeights(weights);
- }
-
- //!
- //! \brief Get the bias weights.
- //!
- //! \see setBiasWeightsWeights()
- //!
- Weights getBiasWeights() const noexcept
- {
- return mImpl->getBiasWeights();
- }
-
- //!
- //! \brief Append or replace an input of this layer with a specific tensor
- //!
- //! \param index the index of the input to modify.
- //! \param tensor the new input tensor
- //!
- //! Only index 0 (data input) is valid, unless explicit-quantization mode is enabled.
- //! In explicit-quantization mode, input with index 1 is the kernel-weights tensor, if present.
- //! The kernel-weights tensor must be a build-time constant (computable at build-time via constant-folding)
- //! and an output of a dequantize layer.
- //! If input index 1 is used then the kernel-weights parameter must be set to empty Weights.
- //!
- //! \see getKernelWeights(), setKernelWeights()
- //!
- //! The indices are as follows:
- //!
- //! - 0: The input activation tensor.
- //! - 1: The kernel weights tensor (a constant tensor).
- //!
- //! If this function is called with the value 1, then the function getNbInputs() changes
- //! from returning 1 to 2.
- using ILayer::setInput;
-
-protected:
- virtual ~IFullyConnectedLayer() noexcept = default;
- apiv::VFullyConnectedLayer* mImpl;
-};
-
//!
//! \class IActivationLayer
//!
@@ -1712,9 +1417,9 @@ class IActivationLayer : public ILayer
//!
enum class PoolingType : int32_t
{
- kMAX = 0, // Maximum over elements
- kAVERAGE = 1, // Average over elements. If the tensor is padded, the count includes the padding
- kMAX_AVERAGE_BLEND = 2 // Blending between max and average pooling: (1-blendFactor)*maxPool + blendFactor*avgPool
+ kMAX = 0, //!< Maximum over elements
+ kAVERAGE = 1, //!< Average over elements. If the tensor is padded, the count includes the padding
+ kMAX_AVERAGE_BLEND = 2 //!< Blending between max and average pooling: (1-blendFactor)*maxPool + blendFactor*avgPool
};
namespace impl
@@ -1767,90 +1472,6 @@ class IPoolingLayer : public ILayer
return mImpl->getPoolingType();
}
- //!
- //! \brief Set the window size for pooling.
- //!
- //! If executing this layer on DLA, both height and width of window size must be in the range [1,8].
- //!
- //! \see getWindowSize()
- //!
- //! \deprecated Superseded by setWindowSizeNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
- //!
- TRT_DEPRECATED void setWindowSize(DimsHW windowSize) noexcept
- {
- mImpl->setWindowSize(windowSize);
- }
-
- //!
- //! \brief Get the window size for pooling.
- //!
- //! \see setWindowSize()
- //!
- //! \deprecated Superseded by getWindowSizeNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
- //!
- TRT_DEPRECATED DimsHW getWindowSize() const noexcept
- {
- return mImpl->getWindowSize();
- }
-
- //!
- //! \brief Set the stride for pooling.
- //!
- //! Default: 1
- //!
- //! If executing this layer on DLA, both height and width of stride must be in the range [1,16].
- //!
- //! \see getStride()
- //!
- //! \deprecated Superseded by setStrideNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
- //!
- TRT_DEPRECATED void setStride(DimsHW stride) noexcept
- {
- mImpl->setStride(stride);
- }
-
- //!
- //! \brief Get the stride for pooling.
- //!
- //! \see setStride()
- //!
- //! \deprecated Superseded by getStrideNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
- //!
- TRT_DEPRECATED DimsHW getStride() const noexcept
- {
- return mImpl->getStride();
- }
-
- //!
- //! \brief Set the padding for pooling.
- //!
- //! Default: 0
- //!
- //! If executing this layer on DLA, both height and width of padding must be in the range [0,7].
- //!
- //! \see getPadding()
- //!
- //! \deprecated Superseded by setPaddingNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
- //!
- TRT_DEPRECATED void setPadding(DimsHW padding) noexcept
- {
- mImpl->setPadding(padding);
- }
-
- //!
- //! \brief Get the padding for pooling.
- //!
- //! Default: 0
- //!
- //! \see setPadding()
- //!
- //! \deprecated Superseded by getPaddingNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
- //!
- TRT_DEPRECATED DimsHW getPadding() const noexcept
- {
- return mImpl->getPadding();
- }
-
//!
//! \brief Set the blending factor for the max_average_blend mode:
//! max_average_blendPool = (1-blendFactor)*maxPool + blendFactor*avgPool
@@ -1886,9 +1507,6 @@ class IPoolingLayer : public ILayer
//!
//! Default: true
//!
- //! \note On Xavier, DLA supports only inclusive padding and this must be explicitly
- //! set to false.
- //!
//! \see getAverageCountExcludesPadding()
//!
void setAverageCountExcludesPadding(bool exclusive) noexcept
@@ -1920,7 +1538,7 @@ class IPoolingLayer : public ILayer
//!
//! \see getPrePadding()
//!
- void setPrePadding(Dims padding) noexcept
+ void setPrePadding(Dims const& padding) noexcept
{
mImpl->setPrePadding(padding);
}
@@ -1948,7 +1566,7 @@ class IPoolingLayer : public ILayer
//!
//! \see getPostPadding()
//!
- void setPostPadding(Dims padding) noexcept
+ void setPostPadding(Dims const& padding) noexcept
{
mImpl->setPostPadding(padding);
}
@@ -1995,7 +1613,7 @@ class IPoolingLayer : public ILayer
//!
//! \see getWindowSizeNd() setWindowSize() getWindowSize()
//!
- void setWindowSizeNd(Dims windowSize) noexcept
+ void setWindowSizeNd(Dims const& windowSize) noexcept
{
mImpl->setWindowSizeNd(windowSize);
}
@@ -2018,9 +1636,9 @@ class IPoolingLayer : public ILayer
//! If executing this layer on DLA, only support 2D stride, both height and width of stride must be in the range
//! [1,16].
//!
- //! \see getStrideNd() setStride() getStride()
+ //! \see getStrideNd()
//!
- void setStrideNd(Dims stride) noexcept
+ void setStrideNd(Dims const& stride) noexcept
{
mImpl->setStrideNd(stride);
}
@@ -2049,7 +1667,7 @@ class IPoolingLayer : public ILayer
//!
//! \see getPaddingNd() setPadding() getPadding()
//!
- void setPaddingNd(Dims padding) noexcept
+ void setPaddingNd(Dims const& padding) noexcept
{
mImpl->setPaddingNd(padding);
}
@@ -2092,7 +1710,7 @@ class ILRNLayer : public ILayer
//!
//! \see setWindowStride()
//!
- void setWindowSize(int32_t windowSize) noexcept
+ void setWindowSize(int64_t windowSize) noexcept
{
mImpl->setWindowSize(windowSize);
}
@@ -2102,7 +1720,7 @@ class ILRNLayer : public ILayer
//!
//! \see getWindowStride()
//!
- int32_t getWindowSize() const noexcept
+ int64_t getWindowSize() const noexcept
{
return mImpl->getWindowSize();
}
@@ -2111,6 +1729,7 @@ class ILRNLayer : public ILayer
//! \brief Set the LRN alpha value.
//!
//! The valid range is [-1e20, 1e20].
+ //!
//! \see getAlpha()
//!
void setAlpha(float alpha) noexcept
@@ -2132,6 +1751,7 @@ class ILRNLayer : public ILayer
//! \brief Set the LRN beta value.
//!
//! The valid range is [0.01, 1e5f].
+ //!
//! \see getBeta()
//!
void setBeta(float beta) noexcept
@@ -2153,6 +1773,7 @@ class ILRNLayer : public ILayer
//! \brief Set the LRN K value.
//!
//! The valid range is [1e-5, 1e10].
+ //!
//! \see getK()
//!
void setK(float k) noexcept
@@ -2214,8 +1835,7 @@ constexpr inline int32_t EnumMax() noexcept
//!
//! The output size is the same as the input size.
//!
-//! \note The input tensor for this layer is required to have a minimum of 3 dimensions in implicit batch mode
-//! and a minimum of 4 dimensions in explicit batch mode.
+//! \note The input tensor is required to have at least 4 dimensions.
//!
//! A scale layer may be used as an INT8 quantization node in a graph, if the output is constrained to INT8 and
//! the input to FP32. Quantization rounds ties to even, and clamps to [-128, 127].
@@ -2357,8 +1977,7 @@ class IScaleLayer : public ILayer
//!
//! The output size is the same as the input size.
//!
-//! On Xavier, this layer is not supported on DLA.
-//! Otherwise, the following constraints must be satisfied to execute this layer on DLA:
+//! The following constraints must be satisfied to execute this layer on DLA:
//! * Axis must be one of the channel or spatial dimensions.
//! * There are two classes of supported input sizes:
//! 1. Non-axis, non-batch dimensions are all 1 and the axis dimension is at most 8192.
@@ -2376,17 +1995,8 @@ class ISoftMaxLayer : public ILayer
//! \brief Set the axis along which softmax is computed. Currently, only one axis can be set.
//!
//! The axis is specified by setting the bit corresponding to the axis to 1.
- //! For example, consider an NCHW tensor as input (three non-batch dimensions).
- //!
- //! In implicit mode :
- //! Bit 0 corresponds to the C dimension boolean.
- //! Bit 1 corresponds to the H dimension boolean.
- //! Bit 2 corresponds to the W dimension boolean.
- //! By default, softmax is performed on the axis which is the number of axes minus three. It is 0 if
- //! there are fewer than 3 non-batch axes. For example, if the input is NCHW, the default axis is C. If the input
- //! is NHW, then the default axis is H.
+ //! For example, consider an NCHW tensor as input.
//!
- //! In explicit mode :
//! Bit 0 corresponds to the N dimension boolean.
//! Bit 1 corresponds to the C dimension boolean.
//! Bit 2 corresponds to the H dimension boolean.
@@ -2395,8 +2005,7 @@ class ISoftMaxLayer : public ILayer
//! there are fewer than 3 axes. For example, if the input is NCHW, the default axis is C. If the input
//! is NHW, then the default axis is N.
//!
- //! For example, to perform softmax on axis R of a NPQRCHW input, set bit 2 with implicit batch mode,
- //! set bit 3 with explicit batch mode.
+ //! For example, to perform softmax on axis R of a NPQRCHW input, set bit 3.
//!
//! \param axes The axis along which softmax is computed.
//! Here axes is a bitmap. For example, when doing softmax along axis 0, bit 0 is set to 1, axes = 1 << axis
@@ -2442,7 +2051,6 @@ class IConcatenationLayer : public ILayer
//!
//! The default axis is the number of tensor dimensions minus three, or zero if the tensor has fewer than three
//! dimensions. For example, for a tensor with dimensions NCHW, it is C.
- //! For implicit batch mode, the number of tensor dimensions does NOT include the implicit batch dimension.
//!
//! When running this layer on the DLA, the concatenation axis must be the third to last axis, e.g. C if tensor
//! dimensions are NCHW.
@@ -2480,41 +2088,13 @@ class IDeconvolutionLayer : public ILayer
{
public:
//!
- //! \brief Set the HW kernel size of the convolution.
- //!
- //! If executing this layer on DLA, both height and width of kernel size must be in the range [1,32], or the
- //! combinations of [64, 96, 128] in one dimension and 1 in the other dimensions, i.e. [1x64] or [64x1] are valid,
- //! but not [64x64].
- //!
- //! \see getKernelSize()
- //!
- //! \deprecated Superseded by setKernelSizeNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
- //!
- TRT_DEPRECATED void setKernelSize(DimsHW kernelSize) noexcept
- {
- mImpl->setKernelSize(kernelSize);
- }
-
- //!
- //! \brief Get the HW kernel size of the deconvolution.
- //!
- //! \see setKernelSize()
- //!
- //! \deprecated Superseded by getKernelSizeNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
- //!
- TRT_DEPRECATED DimsHW getKernelSize() const noexcept
- {
- return mImpl->getKernelSize();
- }
-
- //!
- //! \brief Set the number of output feature maps for the deconvolution.
+ //! \brief Set the number of output feature maps for the deconvolution.
//!
//! If executing this layer on DLA, the number of output maps must be in the range [1,8192].
//!
//! \see getNbOutputMaps()
//!
- void setNbOutputMaps(int32_t nbOutputMaps) noexcept
+ void setNbOutputMaps(int64_t nbOutputMaps) noexcept
{
mImpl->setNbOutputMaps(nbOutputMaps);
}
@@ -2524,73 +2104,11 @@ class IDeconvolutionLayer : public ILayer
//!
//! \see setNbOutputMaps()
//!
- int32_t getNbOutputMaps() const noexcept
+ int64_t getNbOutputMaps() const noexcept
{
return mImpl->getNbOutputMaps();
}
- //!
- //! \brief Set the stride of the deconvolution.
- //!
- //! If executing this layer on DLA, there is one restriction:
- //! 1) Stride height and width must be in the range [1,32] or the combinations of [64, 96, 128] in one
- //! dimension and 1 in the other dimensions, i.e. [1x64] or [64x1] are valid, but not [64x64].
- //!
- //! \see getStride()
- //!
- //! \deprecated Superseded by setStrideNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
- //!
- TRT_DEPRECATED void setStride(DimsHW stride) noexcept
- {
- mImpl->setStride(stride);
- }
-
- //!
- //! \brief Get the stride of the deconvolution.
- //!
- //! Default: (1,1)
- //!
- //! \deprecated Superseded by getStrideNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
- //!
- TRT_DEPRECATED DimsHW getStride() const noexcept
- {
- return mImpl->getStride();
- }
-
- //!
- //! \brief Set the padding of the deconvolution.
- //!
- //! The output will be trimmed by this number of elements on each side in the height and width directions.
- //! In other words, it resembles the inverse of a convolution layer with this padding size.
- //! Padding is symmetric, and negative padding is not supported.
- //!
- //! Default: (0,0)
- //!
- //! If executing this layer on DLA, both height and width of padding must be 0.
- //!
- //! \see getPadding()
- //!
- //! \deprecated Superseded by setPaddingNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
- //!
- TRT_DEPRECATED void setPadding(DimsHW padding) noexcept
- {
- mImpl->setPadding(padding);
- }
-
- //!
- //! \brief Get the padding of the deconvolution.
- //!
- //! Default: (0, 0)
- //!
- //! \see setPadding()
- //!
- //! \deprecated Superseded by getPaddingNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
- //!
- TRT_DEPRECATED DimsHW getPadding() const noexcept
- {
- return mImpl->getPadding();
- }
-
//!
//! \brief Set the number of groups for a deconvolution.
//!
@@ -2606,7 +2124,7 @@ class IDeconvolutionLayer : public ILayer
//!
//! \see getNbGroups()
//!
- void setNbGroups(int32_t nbGroups) noexcept
+ void setNbGroups(int64_t nbGroups) noexcept
{
mImpl->setNbGroups(nbGroups);
}
@@ -2616,7 +2134,7 @@ class IDeconvolutionLayer : public ILayer
//!
//! \see setNbGroups()
//!
- int32_t getNbGroups() const noexcept
+ int64_t getNbGroups() const noexcept
{
return mImpl->getNbGroups();
}
@@ -2683,7 +2201,7 @@ class IDeconvolutionLayer : public ILayer
//!
//! \see getPrePadding()
//!
- void setPrePadding(Dims padding) noexcept
+ void setPrePadding(Dims const& padding) noexcept
{
mImpl->setPrePadding(padding);
}
@@ -2711,7 +2229,7 @@ class IDeconvolutionLayer : public ILayer
//!
//! \see getPostPadding()
//!
- void setPostPadding(Dims padding) noexcept
+ void setPostPadding(Dims const& padding) noexcept
{
mImpl->setPostPadding(padding);
}
@@ -2760,9 +2278,9 @@ class IDeconvolutionLayer : public ILayer
//! 2) Kernel height and width must be in the range [1,32] or the combinations of [64, 96, 128] in one
//! dimension and 1 in the other dimensions, i.e. [1x64] or [64x1] are valid, but not [64x64].
//!
- //! \see getKernelSizeNd() setKernelSize() getKernelSize()
+ //! \see getKernelSizeNd()
//!
- void setKernelSizeNd(Dims kernelSize) noexcept
+ void setKernelSizeNd(Dims const& kernelSize) noexcept
{
mImpl->setKernelSizeNd(kernelSize);
}
@@ -2787,9 +2305,9 @@ class IDeconvolutionLayer : public ILayer
//! 2) Stride height and width must be in the range [1,32] or the combinations of [64, 96, 128] in one
//! dimension and 1 in the other dimensions, i.e. [1x64] or [64x1] are valid, but not [64x64].
//!
- //! \see getStrideNd() setStride() getStride()
+ //! \see getStrideNd()
//!
- void setStrideNd(Dims stride) noexcept
+ void setStrideNd(Dims const& stride) noexcept
{
mImpl->setStrideNd(stride);
}
@@ -2817,7 +2335,7 @@ class IDeconvolutionLayer : public ILayer
//!
//! \see getPaddingNd() setPadding() getPadding()
//!
- void setPaddingNd(Dims padding) noexcept
+ void setPaddingNd(Dims const& padding) noexcept
{
mImpl->setPaddingNd(padding);
}
@@ -2843,17 +2361,19 @@ class IDeconvolutionLayer : public ILayer
//! Input 0 is the input activation tensor.
//! Input 1 is the kernel tensor. If used, the kernel weights parameter must be set to empty weights.
//! Input 2 is the bias tensor. If used, the bias parameter must be set to empty weights.
+ //!
//! \see getKernelWeights(), setKernelWeights(), getBiasWeights(), setBiasWeights()
//!
using ILayer::setInput;
+ //!
//! \brief Set the multi-dimension dilation of the deconvolution.
//!
//! Default: (1, 1, ..., 1)
//!
//! \see getDilationNd()
//!
- void setDilationNd(Dims dilation) noexcept
+ void setDilationNd(Dims const& dilation) noexcept
{
mImpl->setDilationNd(dilation);
}
@@ -2880,9 +2400,10 @@ class IDeconvolutionLayer : public ILayer
//!
//! Operations kAND, kOR, and kXOR must have inputs of DataType::kBOOL.
//!
-//! Operation kPOW must have inputs of DataType::kFLOAT, DataType::kHALF, or DataType::kINT8.
+//! Operation kPOW must have inputs of floating-point type or DataType::kINT8.
//!
-//! All other operations must have inputs of DataType::kFLOAT, DataType::kHALF, DataType::kINT8, or DataType::kINT32.
+//! All other operations must have inputs of floating-point type, DataType::kINT8, DataType::kINT32, or
+//! DataType::kINT64.
//!
//! \see IElementWiseLayer
//!
@@ -3035,7 +2556,7 @@ constexpr inline int32_t EnumMax() noexcept
//! GatherMode::kELEMENT:
//! The output dimensions match the dimensions of the indices tensor.
//!
-//! The types of Data and Output must be the same, and Indices shall be DataType::kINT32.
+//! The types of Data and Output must be the same, and Indices shall be DataType::kINT32 or DataType::kINT64.
//!
//! How the elements of Data are gathered depends on the mode:
//!
@@ -3065,7 +2586,6 @@ constexpr inline int32_t EnumMax() noexcept
//! Notes:
//! * For modes GatherMode::kND and GatherMode::kELEMENT, the first nbElementWiseDims dimensions of data and index must
//! be equal. If not, an error will be reported at build time or run time.
-//! * Only mode GatherMode::kDEFAULT supports an implicit batch dimensions or broadcast on the elementwise dimensions.
//! * If an axis of Data has dynamic length, using a negative index for it has undefined behavior.
//! * No DLA support
//! * Zero will be stored for OOB access
@@ -3091,6 +2611,7 @@ class IGatherLayer : public ILayer
//!
//! \brief Get the axis to gather on.
+ //!
//! \warning Undefined behavior when used with GatherMode::kND.
//!
//! \see setGatherAxis()
@@ -3100,17 +2621,19 @@ class IGatherLayer : public ILayer
return mImpl->getGatherAxis();
}
+ //!
//! \brief Set the number of leading dimensions of indices tensor to be handled elementwise.
+ //!
//! The gathering of indexing starts from the dimension of data[NbElementWiseDims:].
//! The NbElementWiseDims must be less than the Rank of the data input.
+ //!
//! \param elementWiseDims number of dims to be handled as elementwise.
//!
//! Default: 0
//!
//! The value of nbElementWiseDims and GatherMode are checked during network validation:
//!
- //! GatherMode::kDEFAULT: nbElementWiseDims must be 0 if there is an implicit batch dimension. It can be 0 or 1 if
- //! there is not an implicit batch dimension.
+ //! GatherMode::kDEFAULT: nbElementWiseDims can be 0 or 1.
//! GatherMode::kND: nbElementWiseDims can be between 0 and one less than rank(data).
//! GatherMode::kELEMENT: nbElementWiseDims must be 0
//!
@@ -3157,506 +2680,57 @@ class IGatherLayer : public ILayer
};
//!
-//! \enum RNNOperation
-//!
-//! \brief Enumerates the RNN operations that may be performed by an RNN layer.
-//!
-//! __Equation definitions__
-//!
-//! The equations below have the following naming convention:
-//!
-//! ~~~
-//! t := current time step
-//!
-//! i := input gate
-//! o := output gate
-//! f := forget gate
-//! z := update gate
-//! r := reset gate
-//! c := cell gate
-//! h := hidden gate
-//!
-//! g[t] denotes the output of gate g at timestep t, e.g.
-//! f[t] is the output of the forget gate f.
-//!
-//! X[t] := input tensor for timestep t
-//! C[t] := cell state for timestep t
-//! H[t] := hidden state for timestep t
-//!
-//! W[g] := W (input) parameter weight matrix for gate g
-//! R[g] := U (recurrent) parameter weight matrix for gate g
-//! Wb[g] := W (input) parameter bias vector for gate g
-//! Rb[g] := U (recurrent) parameter bias vector for gate g
-//!
-//! Unless otherwise specified, all operations apply pointwise
-//! to elements of each operand tensor.
-//!
-//! ReLU(X) := max(X, 0)
-//! tanh(X) := hyperbolic tangent of X
-//! sigmoid(X) := 1 / (1 + exp(-X))
-//! exp(X) := e^X
-//!
-//! A.B denotes matrix multiplication of A and B.
-//! A*B denotes pointwise multiplication of A and B.
-//! ~~~
-//!
-//! __Equations__
-//!
-//! Depending on the value of RNNOperation chosen, each sub-layer of the RNN
-//! layer will perform one of the following operations:
-//!
-//! ~~~
-//! ::kRELU
-//!
-//! H[t] := ReLU(W[i].X[t] + R[i].H[t-1] + Wb[i] + Rb[i])
-//!
-//! ::kTANH
-//!
-//! H[t] := tanh(W[i].X[t] + R[i].H[t-1] + Wb[i] + Rb[i])
-//!
-//! ::kLSTM
-//!
-//! i[t] := sigmoid(W[i].X[t] + R[i].H[t-1] + Wb[i] + Rb[i])
-//! f[t] := sigmoid(W[f].X[t] + R[f].H[t-1] + Wb[f] + Rb[f])
-//! o[t] := sigmoid(W[o].X[t] + R[o].H[t-1] + Wb[o] + Rb[o])
-//! c[t] := tanh(W[c].X[t] + R[c].H[t-1] + Wb[c] + Rb[c])
-//!
-//! C[t] := f[t]*C[t-1] + i[t]*c[t]
-//! H[t] := o[t]*tanh(C[t])
-//!
-//! ::kGRU
-//!
-//! z[t] := sigmoid(W[z].X[t] + R[z].H[t-1] + Wb[z] + Rb[z])
-//! r[t] := sigmoid(W[r].X[t] + R[r].H[t-1] + Wb[r] + Rb[r])
-//! h[t] := tanh(W[h].X[t] + r[t]*(R[h].H[t-1] + Rb[h]) + Wb[h])
-//!
-//! H[t] := (1 - z[t])*h[t] + z[t]*H[t-1]
-//! ~~~
-//!
-//! \see IRNNv2Layer
-//!
-enum class RNNOperation : int32_t
-{
- kRELU = 0, //!< Single gate RNN w/ ReLU activation function.
- kTANH = 1, //!< Single gate RNN w/ TANH activation function.
- kLSTM = 2, //!< Four-gate LSTM network w/o peephole connections.
- kGRU = 3 //!< Three-gate network consisting of Gated Recurrent Units.
-};
-
-//!
-//! Maximum number of elements in RNNOperation enum.
-//!
-//! \see RNNOperation
-//!
-template <>
-constexpr inline int32_t EnumMax() noexcept
-{
- return 4;
-}
-
-//!
-//! \enum RNNDirection
-//!
-//! \brief Enumerates the RNN direction that may be performed by an RNN layer.
-//!
-//! \see IRNNv2Layer
-//!
-enum class RNNDirection : int32_t
-{
- kUNIDIRECTION = 0, //!< Network iterations from first input to last input.
- kBIDIRECTION = 1 //!< Network iterates from first to last and vice versa and outputs concatenated.
-};
-
-//!
-//! Maximum number of elements in RNNDirection enum.
-//!
-//! \see RNNDirection
-//!
-template <>
-constexpr inline int32_t EnumMax() noexcept
-{
- return 2;
-}
-
-//!
-//! \enum RNNInputMode
-//!
-//! \brief Enumerates the RNN input modes that may occur with an RNN layer.
-//!
-//! If the RNN is configured with RNNInputMode::kLINEAR, then for each gate `g` in the first layer of the RNN,
-//! the input vector `X[t]` (length `E`) is left-multiplied by the gate's corresponding weight matrix `W[g]`
-//! (dimensions `HxE`) as usual, before being used to compute the gate output as described by \ref RNNOperation.
-//!
-//! If the RNN is configured with RNNInputMode::kSKIP, then this initial matrix multiplication is "skipped"
-//! and `W[g]` is conceptually an identity matrix. In this case, the input vector `X[t]` must have length `H`
-//! (the size of the hidden state).
-//!
-//! \see IRNNv2Layer
-//!
-enum class RNNInputMode : int32_t
-{
- kLINEAR = 0, //!< Perform the normal matrix multiplication in the first recurrent layer.
- kSKIP = 1 //!< No operation is performed on the first recurrent layer.
-};
-
-//!
-//! Maximum number of elements in RNNInputMode enum.
-//!
-//! \see RNNInputMode
-//!
-template <>
-constexpr inline int32_t EnumMax() noexcept
-{
- return 2;
-}
-
-//!
-//! \enum RNNGateType
-//!
-//! \brief Identifies an individual gate within an RNN cell.
-//!
-//! \see RNNOperation
-//!
-enum class RNNGateType : int32_t
-{
- kINPUT = 0, //!< Input gate (i).
- kOUTPUT = 1, //!< Output gate (o).
- kFORGET = 2, //!< Forget gate (f).
- kUPDATE = 3, //!< Update gate (z).
- kRESET = 4, //!< Reset gate (r).
- kCELL = 5, //!< Cell gate (c).
- kHIDDEN = 6 //!< Hidden gate (h).
-};
-
-//!
-//! Maximum number of elements in RNNGateType enum.
-//!
-//! \see RNNGateType
-//!
-template <>
-constexpr inline int32_t EnumMax() noexcept
-{
- return 7;
-}
-
-//!
-//! \class IRNNv2Layer
-//!
-//! \brief An RNN layer in a network definition, version 2.
+//! \class IPluginV2Layer
//!
-//! This layer supersedes IRNNLayer.
+//! \brief Layer type for pluginV2
//!
-//! \deprecated Deprecated prior to TensorRT 8.0 and will be removed in 9.0. Superseded by
-//! INetworkDefinition::addLoop().
+//! \see IPluginV2
//!
//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
//!
-class TRT_DEPRECATED IRNNv2Layer : public ILayer
+class IPluginV2Layer : public ILayer
{
public:
- int32_t getLayerCount() const noexcept
- {
- return mImpl->getLayerCount();
- } //!< Get the layer count of the RNN.
- int32_t getHiddenSize() const noexcept
- {
- return mImpl->getHiddenSize();
- } //!< Get the hidden size of the RNN.
- int32_t getMaxSeqLength() const noexcept
- {
- return mImpl->getMaxSeqLength();
- } //!< Get the maximum sequence length of the RNN.
- int32_t getDataLength() const noexcept
- {
- return mImpl->getDataLength();
- } //!< Get the embedding length of the RNN.
-
- //!
- //! \brief Specify individual sequence lengths in the batch with the ITensor pointed to by
- //! \p seqLengths.
- //!
- //! The \p seqLengths ITensor should be a {N1, ..., Np} tensor, where N1..Np are the index dimensions
- //! of the input tensor to the RNN.
- //!
- //! If this is not specified, then the RNN layer assumes all sequences are size getMaxSeqLength().
- //!
- //! All sequence lengths in \p seqLengths should be in the range [1, getMaxSeqLength()]. Zero-length
- //! sequences are not supported.
- //!
- //! This tensor must be of type DataType::kINT32.
- //!
- void setSequenceLengths(ITensor& seqLengths) noexcept
- {
- return mImpl->setSequenceLengths(seqLengths);
- }
-
- //!
- //! \brief Get the sequence lengths specified for the RNN.
- //!
- //! \return nullptr if no sequence lengths were specified, the sequence length data otherwise.
- //!
- //! \see setSequenceLengths()
- //!
- ITensor* getSequenceLengths() const noexcept
- {
- return mImpl->getSequenceLengths();
- }
-
- //!
- //! \brief Set the operation of the RNN layer.
- //!
- //! \see getOperation(), RNNOperation
- //!
- void setOperation(RNNOperation op) noexcept
- {
- mImpl->setOperation(op);
- }
-
- //!
- //! \brief Get the operation of the RNN layer.
- //!
- //! \see setOperation(), RNNOperation
- //!
- RNNOperation getOperation() const noexcept
- {
- return mImpl->getOperation();
- }
-
- //!
- //! \brief Set the input mode of the RNN layer.
- //!
- //! \see getInputMode(), RNNInputMode
- //!
- void setInputMode(RNNInputMode op) noexcept
- {
- mImpl->setInputMode(op);
- }
-
- //!
- //! \brief Get the input mode of the RNN layer.
- //!
- //! \see setInputMode(), RNNInputMode
- //!
- RNNInputMode getInputMode() const noexcept
- {
- return mImpl->getInputMode();
- }
-
- //!
- //! \brief Set the direction of the RNN layer.
- //!
- //! The direction determines if the RNN is run as a unidirectional(left to right) or
- //! bidirectional(left to right and right to left).
- //! In the RNNDirection::kBIDIRECTION case the output is concatenated together, resulting
- //! in output size of 2x getHiddenSize().
//!
- //! \see getDirection(), RNNDirection
- //!
- void setDirection(RNNDirection op) noexcept
- {
- mImpl->setDirection(op);
- }
-
- //!
- //! \brief Get the direction of the RNN layer.
- //!
- //! \see setDirection(), RNNDirection
- //!
- RNNDirection getDirection() const noexcept
- {
- return mImpl->getDirection();
- }
-
- //!
- //! \brief Set the weight parameters for an individual gate in the RNN.
- //!
- //! The DataType for this structure must be DataType::kFLOAT or DataType::kHALF, and must be the same
- //! datatype as the input tensor.
- //!
- //! Each parameter matrix is row-major in memory, and has the following dimensions:
- //!
- //! ~~~
- //! Let K := { ::kUNIDIRECTION => 1
- //! { ::kBIDIRECTION => 2
- //! l := layer index (as described above)
- //! H := getHiddenSize()
- //! E := getDataLength() (the embedding length)
- //! isW := true if the matrix is an input (W) matrix, and false if
- //! the matrix is a recurrent input (R) matrix.
- //!
- //! if isW:
- //! if l < K and ::kSKIP:
- //! (numRows, numCols) := (0, 0) # input matrix is skipped
- //! elif l < K and ::kLINEAR:
- //! (numRows, numCols) := (H, E) # input matrix acts on input data size E
- //! elif l >= K:
- //! (numRows, numCols) := (H, K * H) # input matrix acts on previous hidden state
- //! else: # not isW
- //! (numRows, numCols) := (H, H)
- //! ~~~
- //!
- //! In other words, the input weights of the first layer of the RNN (if
- //! not skipped) transform a `getDataLength()`-size column
- //! vector into a `getHiddenSize()`-size column vector. The input
- //! weights of subsequent layers transform a `K*getHiddenSize()`-size
- //! column vector into a `getHiddenSize()`-size column vector. `K=2` in
- //! the bidirectional case to account for the full hidden state being
- //! the concatenation of the forward and backward RNN hidden states.
- //!
- //! The recurrent weight matrices for all layers all have shape `(H, H)`,
- //! both in the unidirectional and bidirectional cases. (In the
- //! bidirectional case, each recurrent weight matrix for the (forward or
- //! backward) RNN cell operates on the previous (forward or
- //! backward) RNN cell's hidden state, which is size `H`).
- //!
- //! \param layerIndex The index of the layer that contains this gate.
- //! \param gate The name of the gate within the RNN layer. The gate name must correspond
- //! to one of the gates used by this layer's #RNNOperation.
- //! \param isW True if the weight parameters are for the input matrix W[g]
- //! and false if they are for the recurrent input matrix R[g]. See
- //! #RNNOperation for equations showing how these matrices are used
- //! in the RNN gate.
- //! \param weights The weight structure holding the weight parameters, which are stored
- //! as a row-major 2D matrix. See See \ref setWeightsForGate() for documentation on the expected
- //! dimensions of this matrix.
- //!
- void setWeightsForGate(int32_t layerIndex, RNNGateType gate, bool isW, Weights weights) noexcept
- {
- mImpl->setWeightsForGate(layerIndex, gate, isW, weights);
- }
-
- //!
- //! \brief Get the weight parameters for an individual gate in the RNN.
- //!
- //! \see setWeightsForGate()
- //!
- Weights getWeightsForGate(int32_t layerIndex, RNNGateType gate, bool isW) const noexcept
- {
- return mImpl->getWeightsForGate(layerIndex, gate, isW);
- }
-
- //!
- //! \brief Set the bias parameters for an individual gate in the RNN.
- //!
- //! The DataType for this structure must be DataType::kFLOAT or DataType::kHALF, and must be the same
- //! datatype as the input tensor.
- //!
- //! Each bias vector has a fixed size, getHiddenSize().
- //!
- //! \param layerIndex The index of the layer that contains this gate. See \ref setWeightsForGate()
- //! for a description of the layer index.
- //! \param gate The name of the gate within the RNN layer. The gate name must correspond
- //! to one of the gates used by this layer's #RNNOperation.
- //! \param isW True if the bias parameters are for the input bias Wb[g]
- //! and false if they are for the recurrent input bias Rb[g]. See
- //! #RNNOperation for equations showing how these bias vectors are used
- //! in the RNN gate.
- //! \param bias The weight structure holding the bias parameters, which should be an
- //! array of size getHiddenSize().
- //!
- void setBiasForGate(int32_t layerIndex, RNNGateType gate, bool isW, Weights bias) noexcept
- {
- mImpl->setBiasForGate(layerIndex, gate, isW, bias);
- }
-
- //!
- //! \brief Get the bias parameters for an individual gate in the RNN.
- //!
- //! \see setBiasForGate()
- //!
- Weights getBiasForGate(int32_t layerIndex, RNNGateType gate, bool isW) const noexcept
- {
- return mImpl->getBiasForGate(layerIndex, gate, isW);
- }
-
- //!
- //! \brief Set the initial hidden state of the RNN with the provided \p hidden ITensor.
- //!
- //! The \p hidden ITensor should have the dimensions `{N1, ..., Np, L, H}`, where:
- //!
- //! - `N1..Np` are the index dimensions specified by the input tensor
- //! - `L` is the number of layers in the RNN, equal to getLayerCount() if getDirection is
- //! RNNDirection::kUNIDIRECTION,
- //! and 2x getLayerCount() if getDirection is RNNDirection::kBIDIRECTION. In the bi-directional
- //! case, layer `l`'s final forward hidden state is stored in `L = 2*l`, and
- //! final backward hidden state is stored in `L= 2*l + 1`.
- //! - `H` is the hidden state for each layer, equal to getHiddenSize().
- //!
- void setHiddenState(ITensor& hidden) noexcept
- {
- mImpl->setHiddenState(hidden);
- }
-
- //!
- //! \brief Get the initial hidden state of the RNN.
- //!
- //! \see setHiddenState()
- //!
- ITensor* getHiddenState() const noexcept
- {
- return mImpl->getHiddenState();
- }
-
- //!
- //! \brief Set the initial cell state of the LSTM with the provided \p cell ITensor.
- //!
- //! The \p cell ITensor should have the dimensions `{N1, ..., Np, L, H}`, where:
- //!
- //! - `N1..Np` are the index dimensions specified by the input tensor
- //! - `L` is the number of layers in the RNN, equal to getLayerCount() if getDirection is
- //! RNNDirection::kUNIDIRECTION,
- //! and 2x getLayerCount() if getDirection is RNNDirection::kBIDIRECTION. In the bi-directional
- //! case, layer `l`'s final forward hidden state is stored in `L = 2*l`, and
- //! final backward hidden state is stored in `L= 2*l + 1`.
- //! - `H` is the hidden state for each layer, equal to getHiddenSize().
- //!
- //! It is an error to call setCellState() on an RNN layer that is not configured with RNNOperation::kLSTM.
- //!
- void setCellState(ITensor& cell) noexcept
- {
- mImpl->setCellState(cell);
- }
-
- //!
- //! \brief Get the initial cell state of the RNN.
+ //! \brief Get the plugin for the layer.
//!
- //! \see setCellState()
+ //! \see IPluginV2
//!
- ITensor* getCellState() const noexcept
+ IPluginV2& getPlugin() noexcept
{
- return mImpl->getCellState();
+ return mImpl->getPlugin();
}
protected:
- apiv::VRNNv2Layer* mImpl;
- virtual ~IRNNv2Layer() noexcept = default;
+ apiv::VPluginV2Layer* mImpl;
+ virtual ~IPluginV2Layer() noexcept = default;
};
//!
-//! \class IPluginV2Layer
+//! \class IPluginV3Layer
//!
-//! \brief Layer type for pluginV2
+//! \brief Layer type for V3 plugins
//!
-//! \see IPluginV2
+//! \see IPluginV3
//!
//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
//!
-class IPluginV2Layer : public ILayer
+class IPluginV3Layer : public ILayer
{
public:
//!
//! \brief Get the plugin for the layer.
//!
- //! \see IPluginV2
+ //! \see IPluginV3
//!
- IPluginV2& getPlugin() noexcept
+ IPluginV3& getPlugin() noexcept
{
return mImpl->getPlugin();
}
protected:
- apiv::VPluginV2Layer* mImpl;
- virtual ~IPluginV2Layer() noexcept = default;
+ apiv::VPluginV3Layer* mImpl;
+ virtual ~IPluginV3Layer() noexcept = default;
};
//!
@@ -3666,13 +2740,12 @@ class IPluginV2Layer : public ILayer
//!
//! Operations kNOT must have inputs of DataType::kBOOL.
//!
-//! Operation kSIGN must have inputs of DataType::kFLOAT, DataType::kHALF, DataType::kINT8, or DataType::kINT32.
-//!
-//! Operation kISINF must have inputs of DataType::kFLOAT or DataType::kHALF.
+//! Operation kSIGN and kABS must have inputs of floating-point type, DataType::kINT8, DataType::kINT32 or
+//! DataType::kINT64.
//!
-//! All other operations must have inputs of DataType::kFLOAT, DataType::kHALF, or DataType::kINT8.
+//! Operation kISINF must have inputs of floating-point type.
//!
-//! Operations kSIGN and kROUND are not supported in implicit batch mode.
+//! All other operations must have inputs of floating-point type.
//!
//! \see IUnaryLayer
//!
@@ -3878,58 +2951,6 @@ class IReduceLayer : public ILayer
class IPaddingLayer : public ILayer
{
public:
- //!
- //! \brief Set the padding that is applied at the start of the tensor.
- //!
- //! Negative padding results in trimming the edge by the specified amount
- //!
- //! \see getPrePadding
- //!
- //! \deprecated Superseded by setPrePaddingNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
- //!
- TRT_DEPRECATED void setPrePadding(DimsHW padding) noexcept
- {
- mImpl->setPrePadding(padding);
- }
-
- //!
- //! \brief Get the padding that is applied at the start of the tensor.
- //!
- //! \see setPrePadding
- //!
- //! \deprecated Superseded by getPrePaddingNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
- //!
- TRT_DEPRECATED DimsHW getPrePadding() const noexcept
- {
- return mImpl->getPrePadding();
- }
-
- //!
- //! \brief Set the padding that is applied at the end of the tensor.
- //!
- //! Negative padding results in trimming the edge by the specified amount
- //!
- //! \see getPostPadding
- //!
- //! \deprecated Superseded by setPostPaddingNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
- //!
- TRT_DEPRECATED void setPostPadding(DimsHW padding) noexcept
- {
- mImpl->setPostPadding(padding);
- }
-
- //!
- //! \brief Get the padding that is applied at the end of the tensor.
- //!
- //! \see setPostPadding
- //!
- //! \deprecated Superseded by getPostPaddingNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
- //!
- TRT_DEPRECATED DimsHW getPostPadding() const noexcept
- {
- return mImpl->getPostPadding();
- }
-
//!
//! \brief Set the padding that is applied at the start of the tensor.
//!
@@ -3939,7 +2960,7 @@ class IPaddingLayer : public ILayer
//!
//! \see getPrePaddingNd
//!
- void setPrePaddingNd(Dims padding) noexcept
+ void setPrePaddingNd(Dims const& padding) noexcept
{
mImpl->setPrePaddingNd(padding);
}
@@ -3965,7 +2986,7 @@ class IPaddingLayer : public ILayer
//!
//! \see getPostPaddingNd
//!
- void setPostPaddingNd(Dims padding) noexcept
+ void setPostPaddingNd(Dims const& padding) noexcept
{
mImpl->setPostPaddingNd(padding);
}
@@ -3987,6 +3008,11 @@ class IPaddingLayer : public ILayer
virtual ~IPaddingLayer() noexcept = default;
};
+//!
+//! \struct Permutation
+//!
+//! \brief Represents a permutation of dimensions.
+//!
struct Permutation
{
//!
@@ -4059,7 +3085,7 @@ class IShuffleLayer : public ILayer
//!
//! If a second input had been used to create this layer, that input is reset to null by this method.
//!
- void setReshapeDimensions(Dims dimensions) noexcept
+ void setReshapeDimensions(Dims const& dimensions) noexcept
{
mImpl->setReshapeDimensions(dimensions);
}
@@ -4178,7 +3204,6 @@ class IShuffleLayer : public ILayer
enum class SampleMode : int32_t
{
kSTRICT_BOUNDS = 0, //!< Fail with error when the coordinates are out of bounds.
- kDEFAULT TRT_DEPRECATED_ENUM = kSTRICT_BOUNDS, //! \deprecated Use kSTRICT_BOUNDS.
kWRAP = 1, //!< Coordinates wrap around periodically.
kCLAMP = 2, //!< Out of bounds indices are clamped to bounds.
kFILL = 3, //!< Use fill input value when coordinates are out of bounds.
@@ -4187,9 +3212,6 @@ enum class SampleMode : int32_t
//!< pixel and throws error for zero pixels.
};
-//! \deprecated Deprecated in TensorRT 8.5. Superseded by SampleMode.
-using SliceMode = SampleMode;
-
//!
//! Maximum number of elements in SampleMode enum.
//!
@@ -4224,7 +3246,7 @@ constexpr inline int32_t EnumMax() noexcept
//! stride = {1, 2}
//! output = {{1, 5}}
//!
-//! When the sliceMode is kCLAMP or kREFLECT, for each input dimension, if its size is 0 then the corresponding output
+//! When the sampleMode is kCLAMP or kREFLECT, for each input dimension, if its size is 0 then the corresponding output
//! dimension must be 0 too.
//!
//! A slice layer can produce a shape tensor if the following conditions are met:
@@ -4236,7 +3258,7 @@ constexpr inline int32_t EnumMax() noexcept
//!
//! The following constraints must be satisfied to execute this layer on DLA:
//! * start, size, and stride are build time constants, either as static Dims or as constant input tensors.
-//! * sliceMode is kDEFAULT.
+//! * sampleMode is kSTRICT_BOUNDS.
//! * Strides are 1 for all dimensions.
//! * Slicing is not performed on the first dimension
//! * The input tensor has four dimensions
@@ -4255,7 +3277,7 @@ class ISliceLayer : public ILayer
//!
//! \see getStart
//!
- void setStart(Dims start) noexcept
+ void setStart(Dims const& start) noexcept
{
mImpl->setStart(start);
}
@@ -4284,7 +3306,7 @@ class ISliceLayer : public ILayer
//!
//! \see getSize
//!
- void setSize(Dims size) noexcept
+ void setSize(Dims const& size) noexcept
{
return mImpl->setSize(size);
}
@@ -4313,7 +3335,7 @@ class ISliceLayer : public ILayer
//!
//! \see getStride
//!
- void setStride(Dims stride) noexcept
+ void setStride(Dims const& stride) noexcept
{
mImpl->setStride(stride);
}
@@ -4338,7 +3360,7 @@ class ISliceLayer : public ILayer
//!
//! \see getMode()
//!
- void setMode(SliceMode mode) noexcept
+ void setMode(SampleMode mode) noexcept
{
mImpl->setMode(mode);
}
@@ -4348,7 +3370,7 @@ class ISliceLayer : public ILayer
//!
//! \see setMode()
//!
- SliceMode getMode() const noexcept
+ SampleMode getMode() const noexcept
{
return mImpl->getMode();
}
@@ -4387,10 +3409,10 @@ class ISliceLayer : public ILayer
//!
//! \brief Layer type for getting shape of a tensor.
//!
-//! This layer sets the output to a 1D tensor of type Int32 with the dimensions of the input tensor.
+//! This layer sets the output to a 1D tensor of type Int64 with the dimensions of the input tensor.
//!
//! For example, if the input is a four-dimensional tensor (of any type) with
-//! dimensions [2,3,5,7], the output tensor is a one-dimensional Int32 tensor
+//! dimensions [2,3,5,7], the output tensor is a one-dimensional Int64 tensor
//! of length 4 containing the sequence 2, 3, 5, 7.
//!
//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
@@ -4538,10 +3560,10 @@ enum class MatrixOperation : int32_t
//! Treat x as a matrix if it has two dimensions, or as a collection of
//! matrices if x has more than two dimensions, where the last two dimensions
//! are the matrix dimensions. x must have at least two dimensions.
- kNONE,
+ kNONE = 0,
//! Like kNONE, but transpose the matrix dimensions.
- kTRANSPOSE,
+ kTRANSPOSE = 1,
//! Treat x as a vector if it has one dimension, or as a collection of
//! vectors if x has more than one dimension. x must have at least one dimension.
@@ -4553,7 +3575,7 @@ enum class MatrixOperation : int32_t
//! The second input tensor with dimensions [M,K] used with MatrixOperation::kVECTOR is equivalent to a tensor
//! with dimensions [M, K, 1] with MatrixOperation::kNONE, i.e. is treated as M column vectors of length K,
//! or dimensions [M, 1, K] with MatrixOperation::kTRANSPOSE.
- kVECTOR
+ kVECTOR = 2,
};
//!
@@ -4597,8 +3619,10 @@ class IMatrixMultiplyLayer : public ILayer
public:
//!
//! \brief Set the operation for an input tensor.
+ //!
//! \param index Input tensor number (0 or 1).
//! \param op New operation.
+ //!
//! \see getOperation()
//!
void setOperation(int32_t index, MatrixOperation op) noexcept
@@ -4718,6 +3742,10 @@ class ICastLayer : public ILayer
//!
//! \brief Set cast layer output type.
//!
+ //! \param toType The DataType of the output tensor.
+ //!
+ //! Set the output type of the cast layer.
+ //!
void setToType(DataType toType) noexcept
{
mImpl->setToType(toType);
@@ -4726,6 +3754,9 @@ class ICastLayer : public ILayer
//!
//! \brief Return cast layer output type.
//!
+ //! \return toType parameter set during layer creation or by setToType().
+ //! The return value is the output type of the cast layer.
+ //!
DataType getToType() const noexcept
{
return mImpl->getToType();
@@ -4750,9 +3781,8 @@ class IConstantLayer : public ILayer
//!
//! \brief Set the weights for the layer.
//!
- //! If weights.type is DataType::kINT32, the output is a tensor of 32-bit indices.
- //! Otherwise the output is a tensor of real values and the output type will be
- //! follow TensorRT's normal precision rules.
+ //! The output type is weights.type. If the network is weakly typed and the weights have a real type,
+ //! the output type might be different per TensorRT's type conversion rules.
//!
//! \see getWeights()
//!
@@ -4778,7 +3808,7 @@ class IConstantLayer : public ILayer
//!
//! \see setDimensions
//!
- void setDimensions(Dims dimensions) noexcept
+ void setDimensions(Dims const& dimensions) noexcept
{
mImpl->setDimensions(dimensions);
}
@@ -4828,9 +3858,6 @@ enum class InterpolationMode : int32_t
kCUBIC = 2 //!< Supports bicubic (2D) interpolation
};
-//! \deprecated Deprecated in TensorRT 8.5. Superseded by InterpolationMode.
-using ResizeMode = InterpolationMode;
-
namespace impl
{
//!
@@ -4972,13 +3999,13 @@ struct EnumMaxImpl
//! Resize layer can be used for resizing a N-D tensor.
//!
//! Resize layer currently supports the following configurations:
-//! - ResizeMode::kNEAREST - resizes innermost `m` dimensions of N-D, where 0 < m <= min(8, N) and N > 0
-//! - ResizeMode::kLINEAR - resizes innermost `m` dimensions of N-D, where 0 < m <= min(3, N) and N > 0
+//! - InterpolationMode::kNEAREST - resizes innermost `m` dimensions of N-D, where 0 < m <= min(8, N) and N > 0
+//! - InterpolationMode::kLINEAR - resizes innermost `m` dimensions of N-D, where 0 < m <= min(3, N) and N > 0
//!
-//! Default resize mode is ResizeMode::kNEAREST.
+//! Default resize mode is InterpolationMode::kNEAREST.
//!
//! The coordinates in the output tensor are mapped to coordinates in the input tensor using a function set by calling
-//! setCoordinateTransformation(). The default for all ResizeMode settings (nearest, linear, bilinear, etc.) is
+//! setCoordinateTransformation(). The default for all InterpolationMode settings (nearest, linear, bilinear, etc.) is
//! ResizeCoordinateTransformation::kASYMMETRIC.
//!
//! The resize layer provides two ways to resize tensor dimensions.
@@ -5022,7 +4049,7 @@ class IResizeLayer : public ILayer
//! \see setScales
//! \see getOutputDimensions
//!
- void setOutputDimensions(Dims dimensions) noexcept
+ void setOutputDimensions(Dims const& dimensions) noexcept
{
return mImpl->setOutputDimensions(dimensions);
}
@@ -5091,11 +4118,11 @@ class IResizeLayer : public ILayer
//!
//! Supported resize modes are Nearest Neighbor and Linear.
//!
- //! \see ResizeMode
+ //! \see InterpolationMode
//!
- void setResizeMode(ResizeMode resizeMode) noexcept
+ void setResizeMode(InterpolationMode interpolationMode) noexcept
{
- mImpl->setResizeMode(resizeMode);
+ mImpl->setResizeMode(interpolationMode);
}
//!
@@ -5103,39 +4130,11 @@ class IResizeLayer : public ILayer
//!
//! \return The resize mode.
//!
- ResizeMode getResizeMode() const noexcept
+ InterpolationMode getResizeMode() const noexcept
{
return mImpl->getResizeMode();
}
- //!
- //! \brief Set whether to align corners while resizing.
- //!
- //! If true, the centers of the 4 corner pixels of both input and output
- //! tensors are aligned i.e. preserves the values of corner
- //! pixels.
- //!
- //! Default: false.
- //!
- //! \deprecated Deprecated in TensorRT 8.0. Superseded by IResizeLayer::setCoordinateTransformation().
- //!
- TRT_DEPRECATED void setAlignCorners(bool alignCorners) noexcept
- {
- mImpl->setAlignCorners(alignCorners);
- }
-
- //!
- //! \brief True if align corners has been set.
- //!
- //! \return True if align corners has been set, false otherwise.
- //!
- //! \deprecated Deprecated in TensorRT 8.0. Superseded by IResizeLayer::getCoordinateTransformation().
- //!
- TRT_DEPRECATED bool getAlignCorners() const noexcept
- {
- return mImpl->getAlignCorners();
- }
-
//!
//! \brief Append or replace an input of this layer with a specific tensor
//!
@@ -5290,7 +4289,9 @@ class IResizeLayer : public ILayer
apiv::VResizeLayer* mImpl;
};
-//! Enum that describes kinds of loop outputs.
+//!
+//! \enum Enum that describes kinds of loop outputs.
+//!
enum class LoopOutput : int32_t
{
//! Output value is value of tensor for last iteration.
@@ -5314,11 +4315,13 @@ constexpr inline int32_t EnumMax() noexcept
return 3;
}
-//! Enum that describes kinds of trip limits.
+//!
+//! \enum Enum that describes kinds of trip limits.
+//!
enum class TripLimit : int32_t
{
- kCOUNT = 0, //!< Tensor is scalar of type kINT32 that contains the trip count.
+ kCOUNT = 0, //!< Tensor is a scalar of type kINT32 or kINT64 that contains the trip count.
kWHILE = 1 //!< Tensor is a scalar of type kBOOL. Loop terminates when value is false.
};
@@ -5335,10 +4338,17 @@ constexpr inline int32_t EnumMax() noexcept
class ILoop;
+//!
+//! \class ILoopBoundaryLayer
+//!
+//! \brief This is a base class for Loop boundary layers.
+//!
class ILoopBoundaryLayer : public ILayer
{
public:
- //! Return pointer to ILoop associated with this boundary layer.
+ //!
+ //! \brief Get a pointer to ILoop associated with this boundary layer.
+ //!
ILoop* getLoop() const noexcept
{
return mBoundary->getLoop();
@@ -5350,14 +4360,18 @@ class ILoopBoundaryLayer : public ILayer
};
//!
-//! This is a base class for Conditional boundary layers.
+//! \class IIfConditionalBoundaryLayer
+//!
+//! \brief This is a base class for Conditional boundary layers.
//!
//! Boundary layers are used to demarcate the boundaries of Conditionals.
//!
class IIfConditionalBoundaryLayer : public ILayer
{
public:
- //! Return pointer to the IIfConditional associated with this boundary layer.
+ //!
+ //! \brief Get a pointer to the IIfConditional associated with this boundary layer.
+ //!
IIfConditional* getConditional() const noexcept
{
return mBoundary->getConditional();
@@ -5369,7 +4383,9 @@ class IIfConditionalBoundaryLayer : public ILayer
};
//!
-//! This layer represents a condition input to an IIfConditional.
+//! \class IConditionLayer
+//!
+//! \brief This layer represents a condition input to an IIfConditional.
//!
class IConditionLayer : public IIfConditionalBoundaryLayer
{
@@ -5380,7 +4396,9 @@ class IConditionLayer : public IIfConditionalBoundaryLayer
};
//!
-//! This layer represents an output of an IIfConditional.
+//! \class IIfConditionalOutputLayer
+//!
+//! \brief This layer represents an output of an IIfConditional.
//!
//! An IIfConditionalOutputLayer has exactly one output.
//!
@@ -5393,7 +4411,9 @@ class IIfConditionalOutputLayer : public IIfConditionalBoundaryLayer
};
//!
-//! This layer represents an input to an IIfConditional.
+//! \class IIfConditionalInputLayer
+//!
+//! \brief This layer represents an input to an IIfConditional.
//!
class IIfConditionalInputLayer : public IIfConditionalBoundaryLayer
{
@@ -5404,7 +4424,9 @@ class IIfConditionalInputLayer : public IIfConditionalBoundaryLayer
};
//!
-//! Helper for constructing conditionally-executed subgraphs.
+//! \class IIfConditional
+//!
+//! \brief Helper for constructing conditionally-executed subgraphs.
//!
//! An If-conditional conditionally executes part of the network according
//! to the following pseudo-code:
@@ -5416,13 +4438,13 @@ class IIfConditionalInputLayer : public IIfConditionalBoundaryLayer
//! Emit output
//!
//! Condition is a 0D boolean tensor (representing a scalar).
-//! trueSubgraph represents a network subgraph that is executed when condition is evaluated to True.
-//! falseSubgraph represents a network subgraph that is executed when condition is evaluated to False.
+//! trueSubgraph represents a network subgraph that is executed when condition evaluates to True.
+//! falseSubgraph represents a network subgraph that is executed when condition evaluates to False.
//!
//! The following constraints apply to If-conditionals:
//! - Both the trueSubgraph and falseSubgraph must be defined.
//! - The number of output tensors in both subgraphs is the same.
-//! - The type and shape of each output tensor from true/false subgraphs are the same.
+//! - Corresponding output tensors from the true/false subgraphs have the same type and shape.
//!
class IIfConditional : public INoCopy
{
@@ -5499,7 +4521,11 @@ class IIfConditional : public INoCopy
apiv::VIfConditional* mImpl;
};
-
+//!
+//! \class IRecurrenceLayer
+//!
+//! \brief A recurrence layer in a network definition.
+//!
class IRecurrenceLayer : public ILoopBoundaryLayer
{
public:
@@ -5529,7 +4555,9 @@ class IRecurrenceLayer : public ILoopBoundaryLayer
};
//!
-//! An ILoopOutputLayer is the sole way to get output from a loop.
+//! \class ILoopOutputLayer
+//!
+//! \brief An ILoopOutputLayer is the sole way to get output from a loop.
//!
//! The first input tensor must be defined inside the loop; the output tensor is outside the loop.
//! The second input tensor, if present, must be defined outside the loop.
@@ -5548,6 +4576,9 @@ class IRecurrenceLayer : public ILoopBoundaryLayer
class ILoopOutputLayer : public ILoopBoundaryLayer
{
public:
+ //!
+ //! \brief Get which kind a loop output has.
+ //!
LoopOutput getLoopOutput() const noexcept
{
return mImpl->getLoopOutput();
@@ -5570,7 +4601,9 @@ class ILoopOutputLayer : public ILoopBoundaryLayer
mImpl->setAxis(axis);
}
- //! Get axis being concatenated over.
+ //!
+ //! \brief Get axis being concatenated over.
+ //!
int32_t getAxis() const noexcept
{
return mImpl->getAxis();
@@ -5591,7 +4624,7 @@ class ILoopOutputLayer : public ILoopBoundaryLayer
//! The indices in the kCONCATENATE or kREVERSE cases are as follows:
//!
//! - 0: Contribution to the output tensor. The contribution must come from inside the loop.
- //! - 1: The concatenation length scalar value, must come from outside the loop, as a 0D Int32 shape tensor.
+ //! - 1: The concatenation length scalar value, must come from outside the loop, as a 0D Int32 or Int64 shape tensor.
//!
//! If this function is called with the value 1, then the function getNbInputs() changes
//! from returning 1 to 2.
@@ -5603,9 +4636,17 @@ class ILoopOutputLayer : public ILoopBoundaryLayer
apiv::VLoopOutputLayer* mImpl;
};
+//!
+//! \class ITripLimitLayer
+//!
+//! \brief A layer that represents a trip-count limiter.
+//!
class ITripLimitLayer : public ILoopBoundaryLayer
{
public:
+ //!
+ //! \brief Get a trip limiter type.
+ //!
TripLimit getTripLimit() const noexcept
{
return mImpl->getTripLimit();
@@ -5616,32 +4657,49 @@ class ITripLimitLayer : public ILoopBoundaryLayer
apiv::VTripLimitLayer* mImpl;
};
+//!
+//! \class IIteratorLayer
+//!
+//! \brief A layer to do iterations.
+//!
class IIteratorLayer : public ILoopBoundaryLayer
{
public:
- //! Set axis to iterate over.
+ //!
+ //! \brief Set axis to iterate over.
+ //!
void setAxis(int32_t axis) noexcept
{
mImpl->setAxis(axis);
}
- //! Get axis being iterated over.
+ //!
+ //! \brief Get axis being iterated over.
+ //!
int32_t getAxis() const noexcept
{
return mImpl->getAxis();
}
+ //!
+ //! \brief Set iteration order to be reverse.
+ //!
//! For reverse=false, the layer is equivalent to addGather(tensor, I, 0) where I is a
//! scalar tensor containing the loop iteration number.
//! For reverse=true, the layer is equivalent to addGather(tensor, M-1-I, 0) where M is the trip count
//! computed from TripLimits of kind kCOUNT.
//! The default is reverse=false.
+ //!
void setReverse(bool reverse) noexcept
{
mImpl->setReverse(reverse);
}
- //! True if and only if reversing input.
+ //!
+ //! \brief Check if the iteration order is reverse.
+ //!
+ //! \return True if and only if reversing input.
+ //!
bool getReverse() const noexcept
{
return mImpl->getReverse();
@@ -5653,9 +4711,9 @@ class IIteratorLayer : public ILoopBoundaryLayer
};
//!
-//! Helper for creating a recurrent subgraph.
+//! \class ILoop
//!
-//! An ILoop cannot be added to an INetworkDefinition where hasImplicitBatchDimensions() returns true.
+//! \brief Helper for creating a recurrent subgraph.
//!
class ILoop : public INoCopy
{
@@ -5705,6 +4763,7 @@ class ILoop : public INoCopy
return mImpl->addIterator(tensor, axis, reverse);
}
+ //!
//! \brief Make an output for this loop, based on the given tensor.
//!
//! axis is the axis for concatenation (if using outputKind of kCONCATENATE or kREVERSE).
@@ -5747,6 +4806,10 @@ class ILoop : public INoCopy
apiv::VLoop* mImpl;
};
+//!
+//! \class ISelectLayer
+//!
+//! \brief A select layer in a network definition.
//!
//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
//!
@@ -5757,6 +4820,7 @@ class ISelectLayer : public ILayer
apiv::VSelectLayer* mImpl;
};
+//!
//! \class IAssertionLayer
//!
//! \brief An assertion layer in a network
@@ -5812,9 +4876,28 @@ class IAssertionLayer : public ILayer
//!
enum class FillOperation : int32_t
{
- kLINSPACE = 0, //!< Generate evenly spaced numbers over a specified interval.
- kRANDOM_UNIFORM = 1, //!< Generate a tensor with random values drawn from a uniform distribution.
- kRANDOM_NORMAL = 2 //!< Generate a tensor with random values drawn from a normal distribution.
+ //! Compute each value via an affine function of its indices.
+ //! For example, suppose the parameters for the IFillLayer are:
+ //!
+ //! * Dimensions = [3,4]
+ //! * Alpha = 1
+ //! * Beta = [100,10]
+ //!
+ //! Element [i,j] of the output is Alpha + Beta[0]*i + Beta[1]*j.
+ //! Thus the output matrix is:
+ //!
+ //! 1 11 21 31
+ //! 101 111 121 131
+ //! 201 211 221 231
+ //!
+ //! A static beta b is implicitly a 1D tensor, i.e. Beta = [b].
+ kLINSPACE = 0,
+
+ //! Randomly draw values from a uniform distribution.
+ kRANDOM_UNIFORM = 1,
+
+ //! Randomly draw values from a normal distribution.
+ kRANDOM_NORMAL = 2
};
//!
@@ -5829,30 +4912,40 @@ constexpr inline int32_t EnumMax() noexcept
}
//!
-//! \brief Generate an output tensor with specified mode.
+//! \class IFillLayer
+//!
+//! \brief Generate a tensor according to a specified mode.
+//!
+//! The fill layer generates a tensor with values that are drawn from a random distribution
+//! or an affine function of their indices, as specified by the FillMode.
//!
-//! The fill layer has two variants, static and dynamic. Static fill specifies its parameters
-//! at layer creation time via Dims and the get/set accessor functions of the IFillLayer.
-//! Dynamic fill specifies one or more of its parameters as ITensors, by using ILayer::setInput to add
-//! a corresponding input. The corresponding static parameter is used if an input is missing or null.
+//! When an IFillLayer is initially added to a network, all of its parameters are static.
+//! Each parameter may be changed to dynamic by setting a corresponding input.
+//! A parameter is considered dynamic even if that input is the output of an IConstantLayer.
+//! The inputs for each parameter are:
//!
-//! The shape of the output is specified by the parameter \p Dimension, or if non-null and present,
-//! the first input, which must be a 1D Int32 shape tensor. Thus an application can determine if the
-//! IFillLayer has a dynamic output shape based on whether it has a non-null first input.
+//! - 0: Dimensions
+//! - 1: Alpha
+//! - 2: Beta
//!
-//! Alpha and Beta are treated differently based on the Fill Operation specified. See details in
-//! IFillLayer::setAlpha(), IFillLayer::setBeta(), and IFillLayer::setInput().
+//! The parameter Dimensions describes the shape of the output. If the Dimensions input is provided,
+//! it must be a 1D tensor of type Int32 or Int64 whose length is computable by constant folding.
//!
-//! A fill layer can produce a shape tensor if the following restrictions are met:
+//! The meanings of Alpha and Beta depend on the mode, as described in IFillLayer::setAlpha(),
+//! IFillLayer::setBeta(), and IFillLayer::setInput(). Parameters Alpha and Beta must both be static
+//! or both be dynamic.
+//!
+//! An IFillLayer can produce a shape tensor if the following restrictions are met:
//!
//! * The FillOperation is kLINSPACE.
-//! * The output is an Int32 or Float tensor within the volume limit of a shape tensor.
-//! * There is at most one input, and if so, that input is input 0.
-//! * If input 0 exists, the length of the output tensor must be computable by constant folding.
+//! * The output has type Int32, Int64, or Float.
+//! * The volume of the output is within the volume limit imposed on shape tensors.
+//! * If input 0 exists, the values of input 0 must be computable by constant folding.
//!
//! \see FillOperation
//!
//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
+//!
class IFillLayer : public ILayer
{
public:
@@ -5865,7 +4958,7 @@ class IFillLayer : public ILayer
//!
//! \see getDimensions
//
- void setDimensions(Dims dimensions) noexcept
+ void setDimensions(Dims const& dimensions) noexcept
{
mImpl->setDimensions(dimensions);
}
@@ -5915,9 +5008,9 @@ class IFillLayer : public ILayer
//! kRANDOM_UNIFORM | the minimum value, defaults to 0.0;
//! kRANDOM_NORMAL | the mean of the normal distribution, default is 0.0;
//!
- //! If a second input had been used to create this layer, that input is reset to null by this method.
+ //! If input 1 exists, it is reset to null by this method.
//!
- //! \see getAlpha
+ //! \see getAlpha, setAlphaInt64
//
void setAlpha(double alpha) noexcept
{
@@ -5949,7 +5042,7 @@ class IFillLayer : public ILayer
//! kRANDOM_UNIFORM | the maximal value, defaults to 1.0;
//! kRANDOM_NORMAL | the standard deviation of the normal distribution, default is 1.0;
//!
- //! If a third input had been used to create this layer, that input is reset to null by this method.
+ //! If input 2 exists, it is reset to null by this method.
//!
//! \see getBeta
//!
@@ -5966,7 +5059,7 @@ class IFillLayer : public ILayer
//! If the third input is present and non-null,
//! this function returns -1.0.
//!
- //! \see setBeta
+ //! \see setBeta, setBetaInt64
//!
double getBeta() const noexcept
{
@@ -5974,32 +5067,40 @@ class IFillLayer : public ILayer
}
//!
- //! \brief replace an input of this layer with a specific tensor.
+ //! \brief Replace an input of this layer with a specific tensor.
//!
//! \param index the index of the input to set.
//! \param tensor the new input tensor
//!
- //! Indices for kLINSPACE are described as:
+ //! The three inputs correspond to these setters of IFillLayer:
+ //!
+ //! - 0: setDimensions
+ //! - 1: setAlpha
+ //! - 2: setBeta
+ //!
+ //! The following descriptions give more intuitive names for the inputs.
+ //!
+ //! Indices for kLINSPACE are:
//!
- //! - 0: Shape tensor, represents the output tensor's dimensions.
- //! - 1: Start, a scalar, represents the start value.
- //! - 2: Delta, a 1D tensor, length equals to shape tensor's nbDims, represents the delta value for each dimension.
+ //! - 0: Shape, a 1D shape tensor, specifies the output tensor's dimensions.
+ //! - 1: Start, a scalar, specifies the start value.
+ //! - 2: Delta, a 1D tensor, specifies the delta value for each dimension.
//!
- //! Indices for kRANDOM_UNIFORM are described as:
+ //! Indices for kRANDOM_UNIFORM are:
//!
- //! - 0: Shape tensor, represents the output tensor's dimensions.
- //! - 1: Minimum, a scalar, represents the minimum random value.
- //! - 2: Maximum, a scalar, represents the maximal random value.
+ //! - 0: Shape, a 1D shape tensor, specifies the output tensor's dimensions.
+ //! - 1: Minimum, a scalar, specifies the minimum random value.
+ //! - 2: Maximum, a scalar, specifies the maximal random value.
//!
- //! Indices for kRANDOM_NORMAL are described as:
+ //! Indices for kRANDOM_NORMAL are:
//!
- //! - 0: Shape tensor, represents the output tensor's dimensions.
- //! - 1: Mean, a scalar, represents the mean of the normal distribution,.
- //! - 2: Scale, a scalar, represents the standard deviation of the normal distribution.
+ //! - 0: Shape, a 1D shape tensor, specifies the output tensor's dimensions.
+ //! - 1: Mean, a scalar, specifies the mean of the normal distribution,.
+ //! - 2: Scale, a scalar, specifies the standard deviation of the normal distribution.
//!
//! Using the corresponding setter resets the input to null.
//!
- //! If either inputs 1 or 2, is non-null, then both must be non-null and have the same data type.
+ //! If either inputs 1 or 2 is non-null, then both must be non-null and have the same data type.
//!
//! If this function is called for an index greater or equal to getNbInputs(),
//! then afterwards getNbInputs() returns index + 1, and any missing intervening
@@ -6007,6 +5108,111 @@ class IFillLayer : public ILayer
//!
using ILayer::setInput;
+ //!
+ //! \brief Set the alpha parameter with int64 datatype.
+ //!
+ //! \param alpha has different meanings for each operator:
+ //!
+ //! Operation | Usage
+ //! kLINSPACE | the start value, defaults to 0;
+ //! kRANDOM_UNIFORM | the minimum value, defaults to 0;
+ //! kRANDOM_NORMAL | the mean of the normal distribution, default is 0;
+ //!
+ //! If a third input had been used to create this layer, that input is reset to null by this method.
+ //!
+ //! \see getAlphaInt64
+ //
+ void setAlphaInt64(int64_t alpha) noexcept
+ {
+ mImpl->setAlphaInt64(alpha);
+ }
+
+ //!
+ //! \brief Get the value of alpha parameter with int64 datatype.
+ //!
+ //! \return A int64 value of alpha.
+ //!
+ //! If the second input is present and non-null,
+ //! this function returns -1.
+ //!
+ //! \see setAlphaInt64
+ //!
+ int64_t getAlphaInt64() const noexcept
+ {
+ return mImpl->getAlphaInt64();
+ }
+
+ //!
+ //! \brief Set the beta parameter with int64 datatype.
+ //!
+ //! \param beta has different meanings for each operator:
+ //!
+ //! Operation | Usage
+ //! kLINSPACE | the delta value, defaults to 1;
+ //! kRANDOM_UNIFORM | the maximal value, defaults to 1;
+ //! kRANDOM_NORMAL | the standard deviation of the normal distribution, default is 1;
+ //!
+ //! If a third input had been used to create this layer, that input is reset to null by this method.
+ //!
+ //! \see getBetaInt64
+ //!
+ void setBetaInt64(int64_t beta) noexcept
+ {
+ mImpl->setBetaInt64(beta);
+ }
+
+ //!
+ //! \brief Get the value of beta parameter with int64 datatype.
+ //!
+ //! \return A int64 value of beta.
+ //!
+ //! If the third input is present and non-null,
+ //! this function returns -1.0.
+ //!
+ //! \see setBetaInt64
+ //!
+ int64_t getBetaInt64() const noexcept
+ {
+ return mImpl->getBetaInt64();
+ }
+
+ //!
+ //! \brief Return true if alpha/beta have type int64, false if they have type double.
+ //!
+ bool isAlphaBetaInt64() const noexcept
+ {
+ return mImpl->isAlphaBetaInt64();
+ }
+
+ //!
+ //! \brief Set the fill layer output type.
+ //!
+ //! \param toType The DataType of the output tensor.
+ //!
+ //! Set the output type of the fill layer. Valid values are DataType::kFLOAT, DataType::kINT32,
+ //! and DataType::kINT64.
+ //! If the network is strongly typed, setToType must be used to set the output type, and use of setOutputType
+ //! is an error. Otherwise, types passed to setOutputType and setToType must be the same.
+ //!
+ //! \see NetworkDefinitionCreationFlag::kSTRONGLY_TYPED
+ //!
+ void setToType(DataType toType) noexcept
+ {
+ mImpl->setToType(toType);
+ }
+
+ //!
+ //! \brief Get the fill layer output type.
+ //!
+ //! \return toType parameter set during layer creation or by setToType().
+ //! The return value is the output type of the fill layer.
+ //! The default value is DataType::kFLOAT.
+ //!
+ DataType getToType() const noexcept
+ {
+ return mImpl->getToType();
+ }
+
protected:
virtual ~IFillLayer() noexcept = default;
apiv::VFillLayer* mImpl;
@@ -6018,32 +5224,39 @@ class IFillLayer : public ILayer
//! \brief A Quantize layer in a network definition.
//!
//! This layer accepts a floating-point data input tensor, and uses the scale and zeroPt inputs to
-//! quantize the data to an 8-bit signed integer according to:
+//! quantize the data according to:
//! \p output = clamp(round(\p input / \p scale) + \p zeroPt)
//!
//! Rounding type is rounding-to-nearest ties-to-even (https://en.wikipedia.org/wiki/Rounding#Round_half_to_even).
-//! Clamping is in the range [-128, 127].
+//! Clamping range according to data type:
+//! - FP8: [-448, 448]
+//! - INT4: [-8, 7]
+//! - INT8: [-128, 127]
//!
//! The first input (index 0) is the tensor to be quantized.
//! The second (index 1) and third (index 2) are the scale and zero point respectively.
-//! Each of \p scale and \p zeroPt must be either a scalar, or a 1D tensor.
+//! \p scale and \p zeroPt should have identical dimensions, and rank lower or equal to 2.
//!
-//! The \p zeroPt tensor is optional, and if not set, will be assumed to be zero. Its data type must be
-//! DataType::kINT8. \p zeroPt must only contain zero-valued coefficients, because only symmetric quantization is
+//! The \p zeroPt tensor is optional, and if not set, will be assumed to be zero. Its data type must match the
+//! output data type. \p zeroPt must only contain zero-valued coefficients, because only symmetric quantization is
//! supported.
-//! The \p scale value must be either a scalar for per-tensor quantization, or a 1D tensor for per-channel
-//! quantization. All \p scale coefficients must have positive values. The size of the 1-D \p scale tensor must match
-//! the size of the quantization axis. The size of the \p scale must match the size of the \p zeroPt.
+//! The \p scale value must be a scalar for per-tensor quantization, a 1-D tensor for per-channel quantization, or a
+//! 2-D tensor for block quantization (supported for DataType::kINT4 only). All \p scale coefficients must have
+//! positive values. The size of the 1-D \p scale tensor must match the size of the quantization axis. For block
+//! quantization, the shape of \p scale tensor must match the shape of the input, except for one dimension in which
+//! blocking occurs. The size of \p zeroPt must match the size of \p scale.
//!
-//! The subgraph which terminates with the \p scale tensor must be a build-time constant. The same restrictions apply
+//! The subgraph which terminates with the \p scale tensor must be a build-time constant. The same restrictions apply
//! to the \p zeroPt.
-//! The output type, if constrained, must be constrained to DataType::kINT8. The input type, if constrained, must be
-//! constrained to DataType::kFLOAT or DataType::kHALF.
-//! The output size is the same as the input size. The quantization axis is in reference to the input tensor's
-//! dimensions.
+//! The output type, if constrained, must be constrained to DataType::kINT8, DataType::kFP8 or DataType::kINT4. The
+//! input type, if constrained, must be constrained to DataType::kFLOAT, DataType::kHALF, or DataType::kBF16. The
+//! output size is the same as the input size. The quantization axis is in reference to the input tensor's dimensions.
+//!
+//! IQuantizeLayer supports DataType::kFLOAT, DataType::kHALF, or DataType::kBF16 precision and will default to
+//! DataType::kFLOAT precision during instantiation. For strongly typed networks, \p input data type must match the
+//! \p scale data type.
//!
-//! IQuantizeLayer only supports DataType::kFLOAT precision and will default to this precision during instantiation.
-//! IQuantizeLayer only supports DataType::kINT8 output.
+//! IQuantizeLayer supports DataType::kINT8, DataType::kFP8, or DataType::kINT4 output.
//!
//! As an example of the operation of this layer, imagine a 4D NCHW activation input which can be quantized using a
//! single scale coefficient (referred to as per-tensor quantization):
@@ -6062,11 +5275,20 @@ class IFillLayer : public ILayer
//! For each s in S:
//! output[k,c,r,s] = clamp(round(\p input[k,c,r,s] / \p scale[k]) + \p zeroPt[k])
//!
+//! Block quantization is supported only for 2-D weight inputs of DataType::kINT4. As an example of blocked
+//! operation, imagine a 2-D RS weights input, R (dimension 0) as the blocking axis and B as the block size.
+//! The scale is a 2D array of coefficients, with dimensions (R//B, S).
+//! For each r in R:
+//! For each s in S:
+//! output[r,s] = clamp(round(\p input[r,s] / \p scale[r//B, s]) + \p zeroPt[r//B, s])
+//!
//! \note Only symmetric quantization is supported.
//! \note Currently the only allowed build-time constant \p scale and \p zeroPt subgraphs are:
//! 1. Constant -> Quantize
//! 2. Constant -> Cast -> Quantize
//!
+//! \note The input tensor for this layer must not be a scalar.
+//!
//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
//!
class IQuantizeLayer : public ILayer
@@ -6096,6 +5318,34 @@ class IQuantizeLayer : public ILayer
mImpl->setAxis(axis);
}
+ //!
+ //! \brief Set the Quantize layer output type.
+ //!
+ //! \param toType The DataType of the output tensor.
+ //!
+ //! Set the output type of the quantize layer. Valid values are DataType::kINT8 and DataType::kFP8.
+ //! If the network is strongly typed, setToType must be used to set the output type, and use of setOutputType
+ //! is an error. Otherwise, types passed to setOutputType and setToType must be the same.
+ //!
+ //! \see NetworkDefinitionCreationFlag::kSTRONGLY_TYPED
+ //!
+ void setToType(DataType toType) noexcept
+ {
+ mImpl->setToType(toType);
+ }
+
+ //!
+ //! \brief Return the Quantize layer output type.
+ //!
+ //! \return toType parameter set during layer creation or by setToType().
+ //! The return value is the output type of the quantize layer.
+ //! The default value is DataType::kINT8.
+ //!
+ DataType getToType() const noexcept
+ {
+ return mImpl->getToType();
+ }
+
protected:
virtual ~IQuantizeLayer() noexcept = default;
apiv::VQuantizeLayer* mImpl;
@@ -6106,29 +5356,35 @@ class IQuantizeLayer : public ILayer
//!
//! \brief A Dequantize layer in a network definition.
//!
-//! This layer accepts a signed 8-bit integer input tensor, and uses the configured scale and zeroPt inputs to
+//! This layer accepts a quantized type input tensor, and uses the configured scale and zeroPt inputs to
//! dequantize the input according to:
//! \p output = (\p input - \p zeroPt) * \p scale
//!
//! The first input (index 0) is the tensor to be quantized.
//! The second (index 1) and third (index 2) are the scale and zero point respectively.
-//! Each of \p scale and \p zeroPt must be either a scalar, or a 1D tensor.
+//! \p scale and \p zeroPt should have identical dimensions, and rank lower or equal to 2.
//!
-//! The \p zeroPt tensor is optional, and if not set, will be assumed to be zero. Its data type must be
-//! DataType::kINT8. \p zeroPt must only contain zero-valued coefficients, because only symmetric quantization is
+//! The \p zeroPt tensor is optional, and if not set, will be assumed to be zero. Its data type must be identical to
+//! the input's data type. \p zeroPt must only contain zero-valued coefficients, because only symmetric quantization is
//! supported.
-//! The \p scale value must be either a scalar for per-tensor quantization, or a 1D tensor for per-channel
-//! quantization. All \p scale coefficients must have positive values. The size of the 1-D \p scale tensor must match
-//! the size of the quantization axis. The size of the \p scale must match the size of the \p zeroPt.
+//! The \p scale value must be either a scalar for per-tensor quantization, a 1-D tensor for per-channel quantization,
+//! or a 2-D tensor for block quantization (supported for DataType::kINT4 only). All \p scale coefficients must have
+//! positive values. The size of the 1-D \p scale tensor must match the size of the quantization axis. For block
+//! quantization, the shape of \p scale tensor must match the shape of the input, except for one dimension in which
+//! blocking occurs. The size of \p zeroPt must match the size of \p scale.
//!
//! The subgraph which terminates with the \p scale tensor must be a build-time constant. The same restrictions apply
//! to the \p zeroPt.
-//! The output type, if constrained, must be constrained to DataType::kFLOAT or DataType::kHALF. The input type, if
-//! constrained, must be constrained to DataType::kINT8. The output size is the same as the input size. The quantization
-//! axis is in reference to the input tensor's dimensions.
+//! The output type, if constrained, must be constrained to DataType::kFLOAT, DataType::kHALF, or DataType::kBF16. The
+//! input type, if constrained, must be constrained to DataType::kINT8, DataType::kFP8 or DataType::kINT4. The output
+//! size is the same as the input size. The quantization axis is in reference to the input tensor's dimensions.
//!
-//! IDequantizeLayer only supports DataType::kINT8 precision and will default to this precision during instantiation.
-//! IDequantizeLayer only supports DataType::kFLOAT or DataType::kHALF output.
+//! IDequantizeLayer supports DataType::kINT8, DataType::kFP8 or DataType::kINT4 precision and will default to
+//! DataType::kINT8 precision during instantiation. For strongly typed networks, \p input data type must be same as
+//! \p zeroPt data type.
+//!
+//! IDequantizeLayer supports DataType::kFLOAT, DataType::kHALF, or DataType::kBF16 output. For strongly typed
+//! networks, \p output data type is inferred from \p scale data type.
//!
//! As an example of the operation of this layer, imagine a 4D NCHW activation input which can be quantized using a
//! single scale coefficient (referred to as per-tensor quantization):
@@ -6148,11 +5404,21 @@ class IQuantizeLayer : public ILayer
//! For each s in S:
//! output[k,c,r,s] = (\p input[k,c,r,s] - \p zeroPt[k]) * \p scale[k]
//!
+//! Block dequantization is supported only for 2-D input tensors with DataType::kINT4 that are rooted at an
+//! IConstantLayer (i.e. weights). As an example of blocked operation, imagine a 2-D RS weights input with R
+//! (dimension 0) as the blocking axis and B as the block size. The scale is a 2-D array of coefficients, with
+//! dimensions (R//B, S).
+//! For each r in R:
+//! For each s in S:
+//! output[r,s] = (\p input[r,s] - \p zeroPt[r//B, s]) * \p scale[r//B, s]
+//!
//! \note Only symmetric quantization is supported.
//! \note Currently the only allowed build-time constant \p scale and \p zeroPt subgraphs are:
//! 1. Constant -> Quantize
//! 2. Constant -> Cast -> Quantize
//!
+//! \note The input tensor for this layer must not be a scalar.
+//!
//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
//!
class IDequantizeLayer : public ILayer
@@ -6179,7 +5445,35 @@ class IDequantizeLayer : public ILayer
//!
void setAxis(int32_t axis) noexcept
{
- mImpl->setAxis(axis);
+ mImpl->setAxis(axis);
+ }
+
+ //!
+ //! \brief Set the Dequantize layer output type.
+ //!
+ //! \param toType The DataType of the output tensor.
+ //!
+ //! Set the output type of the dequantize layer. Valid values are DataType::kFLOAT and DataType::kHALF.
+ //! If the network is strongly typed, setToType must be used to set the output type, and use of setOutputType
+ //! is an error. Otherwise, types passed to setOutputType and setToType must be the same.
+ //!
+ //! \see NetworkDefinitionCreationFlag::kSTRONGLY_TYPED
+ //!
+ void setToType(DataType toType) noexcept
+ {
+ mImpl->setToType(toType);
+ }
+
+ //!
+ //! \brief Return the Dequantize layer output type.
+ //!
+ //! \return toType parameter set during layer creation or by setToType().
+ //! The return value is the output type of the quantize layer.
+ //! The default value is DataType::kFLOAT.
+ //!
+ DataType getToType() const noexcept
+ {
+ return mImpl->getToType();
}
protected:
@@ -6187,6 +5481,7 @@ class IDequantizeLayer : public ILayer
apiv::VDequantizeLayer* mImpl;
};
+//!
//! \class IEinsumLayer
//!
//! \brief An Einsum layer in a network
@@ -6203,9 +5498,9 @@ class IDequantizeLayer : public ILayer
//! means that those axes will be multiplied. Omitting a label from the output means values along those axes will be
//! summed. In implicit mode, the indices which appear once in the expression will be part of the output in increasing
//! alphabetical order. In explicit mode, the output can be controlled by specifying output subscript labels by adding
-//! an arrow (‘->’) followed by subscripts for the output.
-//! For example, “ij,jk->ik” is equivalent to “ij,jk”.
-//! Ellipsis (‘...’) can be used in place of subscripts to broadcast the dimensions.
+//! an arrow ('->') followed by subscripts for the output.
+//! For example, "ij,jk->ik" is equivalent to "ij,jk".
+//! Ellipsis ('...') can be used in place of subscripts to broadcast the dimensions.
//! See the TensorRT Developer Guide for more details on equation syntax.
//!
//! Many common operations can be expressed using the Einsum equation.
@@ -6254,6 +5549,8 @@ class IEinsumLayer : public ILayer
apiv::VEinsumLayer* mImpl;
};
+//!
+//! \enum ScatterMode
//!
//! \brief Control form of IScatterLayer
//!
@@ -6295,7 +5592,7 @@ constexpr inline int32_t EnumMax() noexcept
//! Scattermode::kELEMENT: s = q = r
//! * Output is a tensor with the same dimensions as Data that stores the resulting values of the
//! transformation. It must not be a shape tensor.
-//! The types of Data, Update, and Output shall be the same, and Indices shall be DataType::kINT32.
+//! The types of Data, Update, and Output shall be the same, and Indices shall be DataType::kINT32 or DataType::kINT64.
//!
//! The output is computed by copying the data, and then updating elements of it based on indices.
//! How Indices are interpreted depends upon the ScatterMode.
@@ -6326,7 +5623,7 @@ constexpr inline int32_t EnumMax() noexcept
//! for c in [0,n)
//! for h in [0,n)
//! for w in [0,n)
-//! output[n,c,indices[n,c,h,w],w] = updates[n,c,h,w]]
+//! output[n,c,indices[n,c,h,w],w] = updates[n,c,h,w]
//!
//! Writes to the same output element cause undefined behavior.
//!
@@ -6391,8 +5688,7 @@ class IScatterLayer : public ILayer
//! The depth tensor must be a build-time constant, and its value should be positive.
//! * Output is a tensor with rank = rank(indices)+1, where the added dimension contains the one-hot encoding.
//! The data types of Output is equal to the Values data type.
-//! * Axis is a scaler specifying to which dimension of the output one-hot encoding is added.
-//! Axis defaults to -1, that is the new dimension in the output is its final dimension.
+//! * Axis is a scalar specifying to which dimension of the output one-hot encoding is added.
//! Valid range for axis is -rank(indices)-1 <= axis <= rank(indices).
//!
//! The output is computed by copying off_values to all output elements, then setting on_value on the indices
@@ -6430,6 +5726,7 @@ class IOneHotLayer : public ILayer
apiv::VOneHotLayer* mImpl;
};
+//!
//! \class IGridSampleLayer
//!
//! \brief A GridSample layer in a network definition.
@@ -6516,6 +5813,8 @@ class IGridSampleLayer : public ILayer
virtual ~IGridSampleLayer() noexcept = default;
}; // class IGridSampleLayer
+//!
+//! \enum BoundingBoxFormat
//!
//! \brief Representation of bounding box data used for the Boxes input tensor in INMSLayer
//!
@@ -6550,7 +5849,10 @@ constexpr inline int32_t EnumMax() noexcept
//! intersection-over-union (IoU) with previously selected boxes is less than or equal to a given threshold.
//! This layer implements NMS per batch item and per class.
//!
-//! For each batch item, the ordering of candidate bounding boxes with the same score is unspecified.
+//! Per batch item, boxes are initially sorted by their scores without regard to class. Only boxes up to a maximum of the TopK limit are considered for selection (per batch).
+//! During selection, only overlapping boxes of the same class are compared, so that overlapping boxes of different classes do not suppress each other.
+//!
+//! For each batch item, the ordering of candidate bounding boxes with the same score is unspecified, but the ordering will be consistent across different runs for the same inputs.
//!
//! The layer has the following inputs, in order of input index:
//!
@@ -6661,6 +5963,7 @@ class INMSLayer : public ILayer
virtual ~INMSLayer() noexcept = default;
}; // class INMSLayer
+//!
//! \class IReverseSequenceLayer
//!
//! \brief A ReverseSequence layer in a network definition.
@@ -6672,7 +5975,7 @@ class INMSLayer : public ILayer
//!
//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
//!
-class IReverseSequenceLayer: public ILayer
+class IReverseSequenceLayer : public ILayer
{
public:
//!
@@ -6726,6 +6029,7 @@ class IReverseSequenceLayer: public ILayer
virtual ~IReverseSequenceLayer() noexcept = default;
}; // class IReverseSequenceLayer
+//!
//! \class INormalizationLayer
//!
//! \brief A normalization layer in a network definition.
@@ -6742,10 +6046,11 @@ class IReverseSequenceLayer: public ILayer
//! Where Mean(X, axes) is a reduction over a set of axes, and Variance(X) = Mean((X - Mean(X, axes)) ^ 2, axes).
//!
//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
-
+//!
class INormalizationLayer : public ILayer
{
public:
+ //!
//! \brief Set the epsilon value used for the normalization calculation.
//!
//! The default value of \p eps is 1e-5F.
@@ -6757,6 +6062,7 @@ class INormalizationLayer : public ILayer
return mImpl->setEpsilon(eps);
}
+ //!
//! \brief Get the epsilon value used for the normalization calculation.
//!
//! \return The epsilon value used for the normalization calculation.
@@ -6766,6 +6072,7 @@ class INormalizationLayer : public ILayer
return mImpl->getEpsilon();
}
+ //!
//! \brief Set the reduction axes for the normalization calculation.
//!
//! \param axesMask The axes used for the normalization calculation.
@@ -6775,6 +6082,7 @@ class INormalizationLayer : public ILayer
return mImpl->setAxes(axesMask);
}
+ //!
//! \brief Get the axes value used for the normalization calculation.
//!
//! \return The axes used for the normalization calculation.
@@ -6784,6 +6092,7 @@ class INormalizationLayer : public ILayer
return mImpl->getAxes();
}
+ //!
//! \brief Set the number of groups used to split the channels in the normalization calculation.
//!
//! The input tensor channels are divided into \p nbGroups groups, and normalization is performed per group.
@@ -6799,30 +6108,38 @@ class INormalizationLayer : public ILayer
//!
//! \param nbGroups The number of groups to split the channels into for the normalization calculation.
//!
- void setNbGroups(int32_t nbGroups) noexcept
+ void setNbGroups(int64_t nbGroups) noexcept
{
return mImpl->setNbGroups(nbGroups);
}
+ //!
//! \brief Get the number of groups used to split the channels for the normalization calculation.
//!
//! \return The number of groups used to split the channel used for the normalization calculation.
//!
- int32_t getNbGroups() const noexcept
+ int64_t getNbGroups() const noexcept
{
return mImpl->getNbGroups();
}
+ //!
//! \brief Set the compute precision of this layer.
//!
//! \param type The datatype used for the compute precision of this layer.
//!
- //! By default TensorRT will run the normalization computation in DataType::kFLOAT32 even in mixed precision
- //! mode regardless of any set builder flags to avoid overflow errors. To override this default,
- //! use this function to set the desired compute precision.
+ //! By default, to avoid overflow errors, TensorRT will run the normalization computation in DataType::kFLOAT32
+ //! even in mixed precision mode regardless of builder flags. To override this default, use this method
+ //! to set the desired compute precision.
+ //!
+ //! For a weakly typed network:
//!
- //! setPrecision() and setOutputPrecision() functions can still be called to control the input and output data types
- //! to this layer.
+ //! * Method setOutputType() can still be called to control the output data type.
+ //!
+ //! * Method setPrecision() can still be called. The input data is cast to that precision before
+ //! being cast to the compute precision.
+ //!
+ //! Neither of these two methods are allowed for a strongly typed network.
//!
//! Only DataType::kFLOAT32 and DataType::kHALF are valid types for \p type.
//!
@@ -6831,6 +6148,7 @@ class INormalizationLayer : public ILayer
return mImpl->setComputePrecision(type);
}
+ //!
//! \brief Get the compute precision of this layer.
//!
//! \return The datatype used for the compute precision of this layer.
@@ -6851,10 +6169,8 @@ class INormalizationLayer : public ILayer
//! \brief A network definition for input to the builder.
//!
//! A network definition defines the structure of the network, and combined with a IBuilderConfig, is built
-//! into an engine using an IBuilder. An INetworkDefinition can either have an implicit batch dimensions, specified
-//! at runtime, or all dimensions explicit, full dims mode, in the network definition. The former mode, i.e. the
-//! implicit batch size mode, has been deprecated. The function hasImplicitBatchDimension() can be used to query the
-//! mode of the network.
+//! into an engine using an IBuilder. An INetworkDefinition can have all dimensions explicit, full dims mode, in the
+//! network definition. The former mode, i.e. the implicit batch size mode, has been deprecated.
//!
//! A network with implicit batch dimensions returns the dimensions of a layer without the implicit dimension,
//! and instead the batch is specified at execute/enqueue time. If the network has all dimensions specified, then
@@ -6875,13 +6191,12 @@ class INetworkDefinition : public INoCopy
//! The name of the input tensor is used to find the index into the buffer array for an engine built from
//! the network. The volume must be less than 2^31 elements.
//!
- //! For networks with an implicit batch dimension, this volume includes the batch dimension with its length set
- //! to the maximum batch size. For networks with all explicit dimensions and with wildcard dimensions, the volume
+ //! For networks with wildcard dimensions, the volume
//! is based on the maxima specified by an IOptimizationProfile.Dimensions are normally non-negative integers. The
//! exception is that in networks with all explicit dimensions, -1 can be used as a wildcard for a dimension to
//! be specified at runtime. Input tensors with such a wildcard must have a corresponding entry in the
//! IOptimizationProfiles indicating the permitted extrema, and the input dimensions must be set by
- //! IExecutionContext::setBindingDimensions. Different IExecutionContext instances can have different dimensions.
+ //! IExecutionContext::setInputShape. Different IExecutionContext instances can have different dimensions.
//! Wildcard dimensions are only supported for EngineCapability::kSTANDARD. They are not
//! supported in safety contexts. DLA does not support Wildcard dimensions.
//!
@@ -6906,7 +6221,7 @@ class INetworkDefinition : public INoCopy
//!
//! \return The new tensor or nullptr if there is an error.
//!
- ITensor* addInput(char const* name, DataType type, Dims dimensions) noexcept
+ ITensor* addInput(char const* name, DataType type, Dims const& dimensions) noexcept
{
return mImpl->addInput(name, type, dimensions);
}
@@ -6926,50 +6241,47 @@ class INetworkDefinition : public INoCopy
}
//!
- //! \brief Add a convolution layer to the network.
- //!
- //! \param input The input tensor to the convolution.
- //! \param nbOutputMaps The number of output feature maps for the convolution.
- //! \param kernelSize The HW-dimensions of the convolution kernel.
- //! \param kernelWeights The kernel weights for the convolution.
- //! \param biasWeights The bias weights for the convolution. Weights{} represents no bias.
+ //! \brief Mark a tensor as a debug tensor.
//!
- //! \see IConvolutionLayer
+ //! A debug tensor can be optionally emitted at runtime.
+ //! Note that tensor names are required to specify debug
+ //! tensors at runtime.
//!
- //! \warning It is an error to specify a wildcard value for the 'C' dimension of the input tensor.
- //! \warning Int32 tensors are not valid input tensors.
+ //! \param tensor Tensor to be marked as debug
//!
- //! \return The new convolution layer, or nullptr if it could not be created.
+ //! \return True if tensor successfully marked (or was already marked), false otherwise.
//!
- //! \deprecated Superseded by addConvolutionNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
+ //! \see unmarkDebug(), IExecutionContext::setDebugListener(), ITensor::setName()
//!
- TRT_DEPRECATED IConvolutionLayer* addConvolution(
- ITensor& input, int32_t nbOutputMaps, DimsHW kernelSize, Weights kernelWeights, Weights biasWeights) noexcept
+ bool markDebug(ITensor& tensor) noexcept
{
- return mImpl->addConvolution(input, nbOutputMaps, kernelSize, kernelWeights, biasWeights);
+ return mImpl->markDebug(tensor);
}
//!
- //! \brief Add a fully connected layer to the network.
+ //! \brief Unmark a tensor as a debug tensor.
//!
- //! \param input The input tensor to the layer.
- //! \param nbOutputs The number of outputs of the layer.
- //! \param kernelWeights The kernel weights for the fully connected layer.
- //! \param biasWeights The bias weights for the fully connected layer. Weights{} represents no bias.
+ //! Remove the marking of a tensor as a debug tensor.
//!
- //! \see IFullyConnectedLayer
+ //! \param tensor Tensor to be unmarked as debug.
//!
- //! \warning It is an error to specify a wildcard value for the 'C' dimension of the input tensor.
- //! \warning Int32 tensors are not valid input tensors.
+ //! \return True if tensor successfully unmarked (or was already unmarked), false otherwise.
+ //!
+ //! \see markDebug(), IExecutionContext::setDebugListener()
+ //!
+ bool unmarkDebug(ITensor& tensor) noexcept
+ {
+ return mImpl->unmarkDebug(tensor);
+ }
+
//!
- //! \return The new fully connected layer, or nullptr if it could not be created.
+ //! \brief Check if a tensor is marked as debug tensor.
//!
- //! \deprecated Deprecated in TensorRT 8.4. Superseded by addMatrixMultiply().
+ //! \return true if tensor is marked as debug tensor, false otherwise.
//!
- TRT_DEPRECATED IFullyConnectedLayer* addFullyConnected(
- ITensor& input, int32_t nbOutputs, Weights kernelWeights, Weights biasWeights) noexcept
+ bool isDebugTensor(nvinfer1::ITensor const& tensor) const noexcept
{
- return mImpl->addFullyConnected(input, nbOutputs, kernelWeights, biasWeights);
+ return mImpl->isDebugTensor(tensor);
}
//!
@@ -6982,7 +6294,8 @@ class INetworkDefinition : public INoCopy
//! output for activations that require these parameters.
//!
//! \see IActivationLayer ActivationType
- //! \warning Int32 tensors are not valid input tensors.
+ //!
+ //! \warning Int32 and Int64 are valid only for activation type kRELU.
//!
//! \return The new activation layer, or nullptr if it could not be created.
//!
@@ -6991,25 +6304,6 @@ class INetworkDefinition : public INoCopy
return mImpl->addActivation(input, type);
}
- //!
- //! \brief Add a pooling layer to the network.
- //!
- //! \param input The input tensor to the layer.
- //! \param type The type of pooling to apply.
- //! \param windowSize The size of the pooling window.
- //!
- //! \see IPoolingLayer PoolingType
- //! \warning Int32 tensors are not valid input tensors.
- //!
- //! \return The new pooling layer, or nullptr if it could not be created.
- //!
- //! \deprecated Superseded by addPoolingNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
- //!
- TRT_DEPRECATED IPoolingLayer* addPooling(ITensor& input, PoolingType type, DimsHW windowSize) noexcept
- {
- return mImpl->addPooling(input, type, windowSize);
- }
-
//!
//! \brief Add a LRN layer to the network.
//!
@@ -7024,7 +6318,7 @@ class INetworkDefinition : public INoCopy
//!
//! \return The new LRN layer, or nullptr if it could not be created.
//!
- ILRNLayer* addLRN(ITensor& input, int32_t window, float alpha, float beta, float k) noexcept
+ ILRNLayer* addLRN(ITensor& input, int64_t window, float alpha, float beta, float k) noexcept
{
return mImpl->addLRN(input, window, alpha, beta, k);
}
@@ -7033,8 +6327,7 @@ class INetworkDefinition : public INoCopy
//! \brief Add a Scale layer to the network.
//!
//! \param input The input tensor to the layer.
- //! This tensor is required to have a minimum of 3 dimensions in implicit batch mode
- //! and a minimum of 4 dimensions in explicit batch mode.
+ //! This tensor must have at least 4 dimensions.
//! \param mode The scaling mode.
//! \param shift The shift value.
//! \param scale The scale value.
@@ -7086,30 +6379,6 @@ class INetworkDefinition : public INoCopy
return mImpl->addConcatenation(inputs, nbInputs);
}
- //!
- //! \brief Add a deconvolution layer to the network.
- //!
- //! \param input The input tensor to the layer.
- //! \param nbOutputMaps The number of output feature maps.
- //! \param kernelSize The HW-dimensions of the deconvolution kernel.
- //! \param kernelWeights The kernel weights for the deconvolution.
- //! \param biasWeights The bias weights for the deconvolution. Weights{} represents no bias.
- //!
- //! \see IDeconvolutionLayer
- //!
- //! \warning It is an error to specify a wildcard value for the 'C' dimension of the input tensor.
- //! \warning Int32 tensors are not valid input tensors.
- //!
- //! \return The new deconvolution layer, or nullptr if it could not be created.
- //!
- //! \deprecated Superseded by addDeconvolutionNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
- //!
- TRT_DEPRECATED IDeconvolutionLayer* addDeconvolution(
- ITensor& input, int32_t nbOutputMaps, DimsHW kernelSize, Weights kernelWeights, Weights biasWeights) noexcept
- {
- return mImpl->addDeconvolution(input, nbOutputMaps, kernelSize, kernelWeights, biasWeights);
- }
-
//!
//! \brief Add an elementwise layer to the network.
//!
@@ -7159,23 +6428,6 @@ class INetworkDefinition : public INoCopy
return mImpl->addUnary(input, operation);
}
- //! \brief Add a padding layer to the network.
- //!
- //! \param input The input tensor to the layer.
- //! \param prePadding The padding to apply to the start of the tensor.
- //! \param postPadding The padding to apply to the end of the tensor.
- //!
- //! \see IPaddingLayer
- //!
- //! \return The new padding layer, or nullptr if it could not be created.
- //!
- //! \deprecated Superseded by addPaddingNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
- //!
- TRT_DEPRECATED IPaddingLayer* addPadding(ITensor& input, DimsHW prePadding, DimsHW postPadding) noexcept
- {
- return mImpl->addPadding(input, prePadding, postPadding);
- }
-
//!
//! \brief Add a shuffle layer to the network.
//!
@@ -7195,7 +6447,7 @@ class INetworkDefinition : public INoCopy
//!
//! \param indices - tensor containing indices where on_value should be set.
//! \param values - a 2-element tensor, consisting of [off_value, on_value].
- //! \param depth - tensor containing the width of the added one-hot dimension.
+ //! \param depth - a shape tensor containing the width of the added one-hot dimension.
//! \param axis - the axis to add the one-hot encoding to.
//!
//! \see IOneHotLayer
@@ -7291,18 +6543,6 @@ class INetworkDefinition : public INoCopy
return mImpl->getOutput(index);
}
- //!
- //! \brief Destroy this INetworkDefinition object.
- //!
- //! \deprecated Deprecated in TensorRT 8.0. Superseded by `delete`.
- //!
- //! \warning Calling destroy on a managed pointer will result in a double-free error.
- //!
- TRT_DEPRECATED void destroy() noexcept
- {
- delete this;
- }
-
//!
//! \brief Add a reduce layer to the network.
//!
@@ -7312,7 +6552,6 @@ class INetworkDefinition : public INoCopy
//! The bit in position i of bitmask reduceAxes corresponds to explicit dimension i if result.
//! E.g., the least significant bit corresponds to the first explicit dimension and the next to least
//! significant bit corresponds to the second explicit dimension.
- //!
//! \param keepDimensions The boolean that specifies whether or not to keep the reduced dimensions in the
//! output of the layer.
//!
@@ -7321,7 +6560,7 @@ class INetworkDefinition : public INoCopy
//!
//! \see IReduceLayer
//!
- //! \warning If output is an Int32 shape tensor, ReduceOperation::kAVG is unsupported.
+ //! \warning If output is an Int32 or Int64 shape tensor, ReduceOperation::kAVG is unsupported.
//!
//! \return The new reduce layer, or nullptr if it could not be created.
//!
@@ -7356,8 +6595,6 @@ class INetworkDefinition : public INoCopy
//!
//! \see ITopKLayer
//!
- //! \warning Int32 tensors are not valid input tensors.
- //!
//! \return The new TopK layer, or nullptr if it could not be created.
//!
ITopKLayer* addTopK(ITensor& input, TopKOperation op, int32_t k, uint32_t reduceAxes) noexcept
@@ -7407,6 +6644,7 @@ class INetworkDefinition : public INoCopy
//!
//! \warning The bounds tensor cannot have the last dimension be the wildcard character.
//! \warning Int32 tensors are not valid input tensors.
+ //! \warning The input and bounds tensors should be 3D tensors.
//!
//! \return The new RaggedSoftMax layer, or nullptr if it could not be created.
//!
@@ -7465,90 +6703,16 @@ class INetworkDefinition : public INoCopy
//! Otherwise the output is a tensor of real values and the output type will be
//! follow TensorRT's normal precision rules.
//!
- //! If tensors in the network have an implicit batch dimension, the constant
- //! is broadcast over that dimension.
- //!
//! If a wildcard dimension is used, the volume of the runtime dimensions must equal
//! the number of weights specified.
//!
//! \warning DataType::kUINT8 not supported.
//!
- IConstantLayer* addConstant(Dims dimensions, Weights weights) noexcept
+ IConstantLayer* addConstant(Dims const& dimensions, Weights weights) noexcept
{
return mImpl->addConstant(dimensions, weights);
}
- //!
- //! \brief Add an \p layerCount deep RNN layer to the network with \p hiddenSize internal states that can
- //! take a batch with fixed or variable sequence lengths.
- //!
- //! \param input The input tensor to the layer (see below).
- //! \param layerCount The number of layers in the RNN.
- //! \param hiddenSize Size of the internal hidden state for each layer.
- //! \param maxSeqLen Maximum sequence length for the input.
- //! \param op The type of RNN to execute.
- //!
- //! By default, the layer is configured with RNNDirection::kUNIDIRECTION and RNNInputMode::kLINEAR.
- //! To change these settings, use IRNNv2Layer::setDirection() and IRNNv2Layer::setInputMode().
- //!
- //! %Weights and biases for the added layer should be set using
- //! IRNNv2Layer::setWeightsForGate() and IRNNv2Layer::setBiasForGate() prior
- //! to building an engine using this network.
- //!
- //! The input tensors must be of the type DataType::kFLOAT or DataType::kHALF.
- //! The layout of the weights is row major and must be the same datatype as the input tensor.
- //! \p weights contain 8 matrices and \p bias contains 8 vectors.
- //!
- //! See IRNNv2Layer::setWeightsForGate() and IRNNv2Layer::setBiasForGate() for details on the required input
- //! format for \p weights and \p bias.
- //!
- //! The \p input ITensor should contain zero or more index dimensions `{N1, ..., Np}`, followed by
- //! two dimensions, defined as follows:
- //! - `S_max` is the maximum allowed sequence length (number of RNN iterations)
- //! - `E` specifies the embedding length (unless RNNInputMode::kSKIP is set, in which case it should match
- //! getHiddenSize()).
- //!
- //! By default, all sequences in the input are assumed to be size \p maxSeqLen. To provide explicit sequence
- //! lengths for each input sequence in the batch, use IRNNv2Layer::setSequenceLengths().
- //!
- //! The RNN layer outputs up to three tensors.
- //!
- //! The first output tensor is the output of the final RNN layer across all timesteps, with dimensions
- //! `{N1, ..., Np, S_max, H}`:
- //!
- //! - `N1..Np` are the index dimensions specified by the input tensor
- //! - `S_max` is the maximum allowed sequence length (number of RNN iterations)
- //! - `H` is an output hidden state (equal to getHiddenSize() or 2x getHiddenSize())
- //!
- //! The second tensor is the final hidden state of the RNN across all layers, and if the RNN
- //! is an LSTM (i.e. getOperation() is RNNOperation::kLSTM), then the third tensor is the final cell state
- //! of the RNN across all layers. Both the second and third output tensors have dimensions
- //! `{N1, ..., Np, L, H}`:
- //!
- //! - `N1..Np` are the index dimensions specified by the input tensor
- //! - `L` is the number of layers in the RNN, equal to getLayerCount() if getDirection is
- //! RNNDirection::kUNIDIRECTION,
- //! and 2x getLayerCount() if getDirection is RNNDirection::kBIDIRECTION. In the bi-directional
- //! case, layer `l`'s final forward hidden state is stored in `L = 2*l`, and
- //! final backward hidden state is stored in `L= 2*l + 1`.
- //! - `H` is the hidden state for each layer, equal to getHiddenSize().
- //!
- //! \see IRNNv2Layer
- //!
- //! \deprecated Deprecated prior to TensorRT 8.0 and will be removed in 9.0. Superseded by
- //! INetworkDefinition::addLoop().
- //!
- //! \warning RNN inputs do not support wildcard dimensions or explicit batch size networks.
- //! \warning Int32 tensors are not valid input tensors, only for sequence lengths.
- //!
- //! \return The new RNN layer, or nullptr if it could not be created.
- //!
- TRT_DEPRECATED IRNNv2Layer* addRNNv2(
- ITensor& input, int32_t layerCount, int32_t hiddenSize, int32_t maxSeqLen, RNNOperation op) noexcept
- {
- return mImpl->addRNNv2(input, layerCount, hiddenSize, maxSeqLen, op);
- }
-
//!
//! \brief Add an identity layer.
//!
@@ -7624,6 +6788,25 @@ class INetworkDefinition : public INoCopy
return mImpl->addPluginV2(inputs, nbInputs, plugin);
}
+ //!
+ //! \brief Add a plugin layer implementing the IPluginV3 interface to the network.
+ //!
+ //! \param inputs The input tensors to the layer.
+ //! \param nbInputs The number of input tensors.
+ //! \param shapeInputs Shape tensor inputs to the layer.
+ //! \param nbShapeInputs The number of shape tensor inputs.
+ //! \param plugin The layer plugin.
+ //!
+ //! \see IPluginV3Layer
+ //!
+ //! \return The new plugin layer, or nullptr if it could not be created.
+ //!
+ IPluginV3Layer* addPluginV3(ITensor* const* inputs, int32_t nbInputs, ITensor* const* shapeInputs,
+ int32_t nbShapeInputs, IPluginV3& plugin) noexcept
+ {
+ return mImpl->addPluginV3(inputs, nbInputs, shapeInputs, nbShapeInputs, plugin);
+ }
+
//!
//! \brief Add a slice layer to the network.
//!
@@ -7638,7 +6821,7 @@ class INetworkDefinition : public INoCopy
//!
//! \return The new slice layer, or nullptr if it could not be created.
//!
- ISliceLayer* addSlice(ITensor& input, Dims start, Dims size, Dims stride) noexcept
+ ISliceLayer* addSlice(ITensor& input, Dims const& start, Dims const& size, Dims const& stride) noexcept
{
return mImpl->addSlice(input, start, size, stride);
}
@@ -7700,21 +6883,39 @@ class INetworkDefinition : public INoCopy
//!
//! \brief Query whether the network was created with an implicit batch dimension.
//!
- //! \return True if tensors have implicit batch dimension, false otherwise.
- //!
- //! This is a network-wide property. Either all tensors in the network
- //! have an implicit batch dimension or none of them do.
- //!
- //! hasImplicitBatchDimension() is true if and only if this INetworkDefinition
- //! was created with createNetworkV2() without NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag.
+ //! \return Always false since TensorRT 10.0 does not support an implicit batch dimension.
//!
//! \see createNetworkV2
//!
- bool hasImplicitBatchDimension() const noexcept
+ //! \deprecated Deprecated in TensorRT 10.0. Implicit batch is not supported since TensorRT 10.0.
+ //!
+ TRT_DEPRECATED bool hasImplicitBatchDimension() const noexcept
{
return mImpl->hasImplicitBatchDimension();
}
+ //!
+ //! \brief Get the network definition creation flags for this network definition object. Defaults to 0.
+ //!
+ //! \return The network definition creation options as a bitmask.
+ //!
+ NetworkDefinitionCreationFlags getFlags() const noexcept
+ {
+ return mImpl->getFlags();
+ }
+
+ //!
+ //! \brief Returns true if the network definition creation flag is set
+ //!
+ //! \see getFlags()
+ //!
+ //! \return True if flag is set, false if unset.
+ //!
+ bool getFlag(NetworkDefinitionCreationFlag networkDefinitionCreationFlag) const noexcept
+ {
+ return mImpl->getFlag(networkDefinitionCreationFlag);
+ }
+
//!
//! \brief Enable tensor's value to be computed by IExecutionContext::getShapeBinding.
//!
@@ -7726,7 +6927,6 @@ class INetworkDefinition : public INoCopy
//!
//! \warning It is an error to mark a network input as a shape output.
//!
- //! \see isShapeBinding(), getShapeBinding()
//!
bool markOutputForShapes(ITensor& tensor) noexcept
{
@@ -7781,7 +6981,7 @@ class INetworkDefinition : public INoCopy
//! \return The new convolution layer, or nullptr if it could not be created.
//!
IConvolutionLayer* addConvolutionNd(
- ITensor& input, int32_t nbOutputMaps, Dims kernelSize, Weights kernelWeights, Weights biasWeights) noexcept
+ ITensor& input, int64_t nbOutputMaps, Dims const& kernelSize, Weights kernelWeights, Weights biasWeights) noexcept
{
return mImpl->addConvolutionNd(input, nbOutputMaps, kernelSize, kernelWeights, biasWeights);
}
@@ -7800,7 +7000,7 @@ class INetworkDefinition : public INoCopy
//!
//! \return The new pooling layer, or nullptr if it could not be created.
//!
- IPoolingLayer* addPoolingNd(ITensor& input, PoolingType type, Dims windowSize) noexcept
+ IPoolingLayer* addPoolingNd(ITensor& input, PoolingType type, Dims const& windowSize) noexcept
{
return mImpl->addPoolingNd(input, type, windowSize);
}
@@ -7823,7 +7023,7 @@ class INetworkDefinition : public INoCopy
//! \return The new deconvolution layer, or nullptr if it could not be created.
//!
IDeconvolutionLayer* addDeconvolutionNd(
- ITensor& input, int32_t nbOutputMaps, Dims kernelSize, Weights kernelWeights, Weights biasWeights) noexcept
+ ITensor& input, int64_t nbOutputMaps, Dims kernelSize, Weights kernelWeights, Weights biasWeights) noexcept
{
return mImpl->addDeconvolutionNd(input, nbOutputMaps, kernelSize, kernelWeights, biasWeights);
}
@@ -7865,6 +7065,7 @@ class INetworkDefinition : public INoCopy
return mImpl->addScaleNd(input, mode, shift, scale, power, channelAxis);
}
+ //!
//! \brief Add a resize layer to the network.
//!
//! \param input The input tensor to the layer.
@@ -7881,35 +7082,35 @@ class INetworkDefinition : public INoCopy
}
//!
- //! \brief True if network is an explicit precision network
+ //! \brief Add a loop to the network.
//!
- //! \deprecated Deprecated in TensorRT 8.0.
+ //! An ILoop provides a way to specify a recurrent subgraph.
//!
- //! \see createNetworkV2
+ //! \return Pointer to ILoop that can be used to add loop-boundary layers for the loop.
//!
- //! \return True if network has explicit precision, false otherwise.
+ //! \see ILoop
//!
- TRT_DEPRECATED bool hasExplicitPrecision() const noexcept
+ ILoop* addLoop() noexcept
{
- return mImpl->hasExplicitPrecision();
+ return mImpl->addLoop();
}
//!
- //! \brief Add a loop to the network.
+ //! \brief Add an if-then-else to the network.
//!
- //! An ILoop provides a way to specify a recurrent subgraph.
+ //! An IIfConditional provides a way to conditionally execute parts of the network.
//!
- //! \return Pointer to ILoop that can be used to add loop boundary layers for the loop,
- //! or nullptr if network has an implicit batch dimension or this version
- //! of TensorRT does not support loops.
+ //! \return Pointer to the IIfConditional that can be used to add conditional-boundary layers
+ //! for the if-then-else.
//!
- //! The network must not have an implicit batch dimension.
+ //! \see IIfConditional
//!
- ILoop* addLoop() noexcept
+ IIfConditional* addIfConditional() noexcept
{
- return mImpl->addLoop();
+ return mImpl->addIfConditional();
}
+ //!
//! \brief Add a select layer to the network.
//!
//! \param condition The condition tensor to the layer. Must have type DataType::kBOOL.
@@ -7938,8 +7139,6 @@ class INetworkDefinition : public INoCopy
//!
//! then the output dimensions are [1,3,0,9].
//!
- //! The network must not have an implicit batch dimension.
- //!
//! The inputs are shape tensors if the output is a shape tensor.
//!
//! \see ISelectLayer
@@ -7967,29 +7166,58 @@ class INetworkDefinition : public INoCopy
return mImpl->addAssertion(condition, message);
}
+ //!
//! \brief Add a fill layer to the network.
//!
- //! \param dimensions The output tensor dimensions.
+ //! \param dimensions The output tensor dimensions if input 0 is missing.
//! \param op The fill operation that the layer applies.
//!
- //! \warning For FillOperation::kLINSPACE, dimensions.nbDims must be 1.
+ //! \warning For FillOperation::kLINSPACE, dimensions.nbDims must be 1 for static start/delta. If delta is provided
+ //! as a 1D tensor, the length of delta must match dimensions.nbDims.
//!
//! This layer is non-deterministic across subsequent calls as the same inputs will produce different
//! output tensors if \p op is either FillOperation::kRANDOM_UNIFORM or FillOperation::kRANDOM_NORMAL
//! due to random state being shared across calls. The output tensors generated are determinstic when
//! starting from the same initial state.
//!
- //! The network must not have an implicit batch dimension.
- //!
//! \see IFillLayer
//!
//! \return The new fill layer, or nullptr if it could not be created.
//!
- IFillLayer* addFill(Dims dimensions, FillOperation op) noexcept
+ //! \deprecated Deprecated in TensorRT 9.0. Superseded by three-argument addFill.
+ //!
+ TRT_DEPRECATED IFillLayer* addFill(Dims const& dimensions, FillOperation op) noexcept
{
return mImpl->addFill(dimensions, op);
}
+ //!
+ //! \brief Add a fill layer to the network.
+ //!
+ //! \param dimensions The output tensor dimensions if input 0 is missing.
+ //! \param op The fill operation that the layer applies.
+ //! \param outputType Optional output tensor data type, must be DataType::kFLOAT, DataType::kHALF, DataType::kINT32,
+ //! or DataType::kINT64. This parameter is only used for static alpha/beta. Future calls to set output type using
+ //! setToType or setOutputType must be consistent.
+ //!
+ //! \warning For FillOperation::kLINSPACE, dimensions.nbDims must be 1 for static start/delta. If delta is provided
+ //! as a 1D tensor, the length of delta must match dimensions.nbDims.
+ //!
+ //! This layer is non-deterministic across subsequent calls as the same inputs will produce different
+ //! output tensors if \p op is either FillOperation::kRANDOM_UNIFORM or FillOperation::kRANDOM_NORMAL
+ //! due to random state being shared across calls. The output tensors generated are deterministic when
+ //! starting from the same initial state.
+ //!
+ //! \see IFillLayer
+ //!
+ //! \return The new fill layer, or nullptr if it could not be created.
+ //!
+ IFillLayer* addFill(Dims const& dimensions, FillOperation op, DataType outputType) noexcept
+ {
+ return mImpl->addFillV2(dimensions, op, outputType);
+ }
+
+ //!
//! \brief Add a padding layer to the network. Only 2D padding is currently supported.
//!
//! \param input The input tensor to the layer.
@@ -8000,13 +7228,12 @@ class INetworkDefinition : public INoCopy
//!
//! \return The new padding layer, or nullptr if it could not be created.
//!
- //! \deprecated Deprecated in TensorRT 8.0. Superseded by addSlice().
- //!
- TRT_DEPRECATED IPaddingLayer* addPaddingNd(ITensor& input, Dims prePadding, Dims postPadding) noexcept
+ IPaddingLayer* addPaddingNd(ITensor& input, Dims const& prePadding, Dims const& postPadding) noexcept
{
return mImpl->addPaddingNd(input, prePadding, postPadding);
}
+ //!
//! \brief Associate a name with all current uses of the given weights.
//!
//! The name must be set after the Weights are used in the network.
@@ -8072,17 +7299,40 @@ class INetworkDefinition : public INoCopy
//!
//! \see IDequantizeLayer
//!
- //! \p input tensor data type must be DataType::kFLOAT.
+ //! \p input tensor data type must be DataType::kINT8/DataType::kFP8.
//! \p scale tensor data type must be DataType::kFLOAT. The subgraph which terminates with the \p scale tensor must
//! be a build-time constant.
//!
//! \return The new quantization layer, or nullptr if it could not be created.
//!
- IDequantizeLayer* addDequantize(ITensor& input, ITensor& scale) noexcept
+ //! \deprecated Deprecated in TensorRT 9.0. Superseded by three-argument addDequantize.
+ //!
+ TRT_DEPRECATED IDequantizeLayer* addDequantize(ITensor& input, ITensor& scale) noexcept
{
return mImpl->addDequantize(input, scale);
}
+ //!
+ //! \brief Add a dequantization layer to the network.
+ //!
+ //! \param input The input tensor to be dequantized.
+ //! \param scale A tensor with the scale value.
+ //!
+ //! \see IDequantizeLayer
+ //!
+ //! \p input tensor data type must be DataType::kINT8/DataType::kFP8/DataType::kINT4.
+ //! \p scale tensor data type defaults to DataType::kFLOAT. For strongly typed networks, it must be the same as the
+ //! output data type. The subgraph which terminates with the \p scale tensor must be a build-time constant.
+ //! \p outputType output tensor data type, default value is DataType::kFLOAT. Future calls to set output type using
+ //! setToType or setOutputType must be consistent. For strongly typed networks, it must be the same as the scale data type.
+ //!
+ //! \return The new quantization layer, or nullptr if it could not be created.
+ //!
+ IDequantizeLayer* addDequantize(ITensor& input, ITensor& scale, DataType outputType) noexcept
+ {
+ return mImpl->addDequantizeV2(input, scale, outputType);
+ }
+
//!
//! \brief Add a Scatter layer to the network with specified mode and axis=0.
//!
@@ -8111,32 +7361,41 @@ class INetworkDefinition : public INoCopy
//!
//! \see IQuantizeLayer
//!
- //! \p input tensor data type must be DataType::kFLOAT.
+ //! \p input tensor data type must be DataType::kFLOAT/DataType::kHALF.
//! \p scale tensor data type must be DataType::kFLOAT. The subgraph which terminates with the \p scale tensor must
//! be a build-time constant.
//!
//! \return The new quantization layer, or nullptr if it could not be created.
//!
- IQuantizeLayer* addQuantize(ITensor& input, ITensor& scale) noexcept
+ //! \deprecated Deprecated in TensorRT 9.0. Superseded by three-argument addQuantize.
+ //!
+ TRT_DEPRECATED IQuantizeLayer* addQuantize(ITensor& input, ITensor& scale) noexcept
{
return mImpl->addQuantize(input, scale);
}
//!
- //! \brief Add an If-conditional layer to the network.
+ //! \brief Add a quantization layer to the network.
//!
- //! An IIfConditional provides a way to conditionally execute parts of the network.
+ //! \param input The input tensor to be quantized.
+ //! \param scale A tensor with the scale value.
//!
- //! \see IIfConditional
+ //! \see IQuantizeLayer
//!
- //! \return The new conditional layer, or nullptr if network has an implicit batch dimension
- //! or this version of TensorRT does not support conditional execution.
+ //! \p input tensor data type must be DataType::kFLOAT/DataType::kHALF/DataType::kBF16.
+ //! \p scale tensor data type defaults to DataType::kFLOAT. For strongly typed networks, it must have the same data
+ //! type as the input. The subgraph which terminates with the \p scale tensor must be a build-time constant.
+ //! \p outputType output tensor data type, must be DataType::kINT8 (default), DataType::kFP8 or DataType::kINT4.
+ //! Future calls to set output type using setToType or setOutputType must be consistent.
//!
- IIfConditional* addIfConditional() noexcept
+ //! \return The new quantization layer, or nullptr if it could not be created.
+ //!
+ IQuantizeLayer* addQuantize(ITensor& input, ITensor& scale, DataType outputType) noexcept
{
- return mImpl->addIfConditional();
+ return mImpl->addQuantizeV2(input, scale, outputType);
}
+ //!
//! \brief Add an Einsum layer to the network.
//!
//! \param inputs The input tensors to the layer.
@@ -8151,10 +7410,12 @@ class INetworkDefinition : public INoCopy
return mImpl->addEinsum(inputs, nbInputs, equation);
}
+ //!
//! \brief Add a GridSample layer to the network.
//!
//! \param input The input tensor to the layer.
//! \param grid The grid tensor to the layer.
+ //!
//! \see IGridSampleLayer
//!
//! Creates a GridSample layer with a InterpolationMode::kLINEAR, unaligned corners,
@@ -8223,8 +7484,7 @@ class INetworkDefinition : public INoCopy
//!
//! \return The new normalization layer, or nullptr if it could not be created.
//!
- INormalizationLayer* addNormalization(
- ITensor& input, ITensor& scale, ITensor& bias, uint32_t axesMask) noexcept
+ INormalizationLayer* addNormalization(ITensor& input, ITensor& scale, ITensor& bias, uint32_t axesMask) noexcept
{
return mImpl->addNormalization(input, scale, bias, axesMask);
}
@@ -8245,16 +7505,16 @@ class INetworkDefinition : public INoCopy
};
//!
-//! enum CalibrationAlgoType
+//! \enum CalibrationAlgoType
//!
//! \brief Version of calibration algorithm to use.
//!
enum class CalibrationAlgoType : int32_t
{
- kLEGACY_CALIBRATION = 0,
- kENTROPY_CALIBRATION = 1,
- kENTROPY_CALIBRATION_2 = 2,
- kMINMAX_CALIBRATION = 3,
+ kLEGACY_CALIBRATION = 0, //!< Legacy calibration
+ kENTROPY_CALIBRATION = 1, //!< Legacy entropy calibration
+ kENTROPY_CALIBRATION_2 = 2, //!< Entropy calibration
+ kMINMAX_CALIBRATION = 3, //!< Minmax calibration
};
//!
@@ -8279,7 +7539,7 @@ constexpr inline int32_t EnumMax() noexcept
//! the distribution of activations. It may optionally implement a method for caching the calibration result for reuse
//! on subsequent runs.
//!
-class IInt8Calibrator
+class IInt8Calibrator : public IVersionedInterface
{
public:
//!
@@ -8287,7 +7547,9 @@ class IInt8Calibrator
//!
//! \return The batch size.
//!
- virtual int32_t getBatchSize() const noexcept = 0;
+ //! \deprecated Deprecated in TensorRT 10.0. Implicit batch support is removed in TensorRT 10.0.
+ //!
+ TRT_DEPRECATED virtual int32_t getBatchSize() const noexcept = 0;
//!
//! \brief Get a batch of input for calibration.
@@ -8298,6 +7560,7 @@ class IInt8Calibrator
//! containing each network input data.
//! \param names The names of the network input for each pointer in the binding array.
//! \param nbBindings The number of pointers in the bindings array.
+ //!
//! \return False if there are no more batches for calibration.
//!
//! \see getBatchSize()
@@ -8337,16 +7600,22 @@ class IInt8Calibrator
//!
virtual CalibrationAlgoType getAlgorithm() noexcept = 0;
- virtual ~IInt8Calibrator() noexcept = default;
+ ~IInt8Calibrator() noexcept override = default;
};
-//!
-//! Entropy calibrator. This is the Legacy Entropy calibrator. It is less complicated than the legacy calibrator and
-//! produces better results.
-//!
+namespace v_1_0
+{
class IInt8EntropyCalibrator : public IInt8Calibrator
{
public:
+ //!
+ //! \brief Return version information associated with this interface. Applications must not override this method.
+ //!
+ InterfaceInfo getInterfaceInfo() const noexcept override
+ {
+ return InterfaceInfo{"IInt8EntropyCalibrator", 1, 0};
+ }
+
//!
//! Signal that this is the entropy calibrator.
//!
@@ -8355,16 +7624,36 @@ class IInt8EntropyCalibrator : public IInt8Calibrator
return CalibrationAlgoType::kENTROPY_CALIBRATION;
}
- virtual ~IInt8EntropyCalibrator() noexcept = default;
+ ~IInt8EntropyCalibrator() noexcept override = default;
};
+} // namespace v_1_0
//!
-//! Entropy calibrator 2. This is the preferred calibrator. This is the required calibrator for DLA, as it supports per
-//! activation tensor scaling.
+//! \class IInt8EntropyCalibrator
+//!
+//! \brief Entropy calibrator.
+//!
+//! This is the Legacy Entropy calibrator. It is less complicated than the legacy calibrator and
+//! produces better results.
+//!
+//! \note To ensure compatibility of source code with future versions of TensorRT, use IEntropyCalibrator, not
+//! v_1_0::IEntropyCalibrator
//!
+using IInt8EntropyCalibrator = v_1_0::IInt8EntropyCalibrator;
+
+namespace v_1_0
+{
class IInt8EntropyCalibrator2 : public IInt8Calibrator
{
public:
+ //!
+ //! \brief Return version information associated with this interface. Applications must not override this method.
+ //!
+ InterfaceInfo getInterfaceInfo() const noexcept override
+ {
+ return InterfaceInfo{"IInt8EntropyCalibrator2", 1, 0};
+ }
+
//!
//! Signal that this is the entropy calibrator 2.
//!
@@ -8373,15 +7662,36 @@ class IInt8EntropyCalibrator2 : public IInt8Calibrator
return CalibrationAlgoType::kENTROPY_CALIBRATION_2;
}
- virtual ~IInt8EntropyCalibrator2() noexcept = default;
+ ~IInt8EntropyCalibrator2() noexcept override = default;
};
+} // namespace v_1_0
//!
-//! MinMax Calibrator. It supports per activation tensor scaling.
+//! \class IInt8EntropyCalibrator2
+//!
+//! \brief Entropy calibrator 2.
+//!
+//! This is the preferred calibrator. This is the required calibrator for DLA, as it supports per
+//! activation tensor scaling.
+//!
+//! \note To ensure compatibility of source code with future versions of TensorRT, use IEntropyCalibrator2, not
+//! v_1_0::IEntropyCalibrator2
//!
+using IInt8EntropyCalibrator2 = v_1_0::IInt8EntropyCalibrator2;
+
+namespace v_1_0
+{
class IInt8MinMaxCalibrator : public IInt8Calibrator
{
public:
+ //!
+ //! \brief Return version information associated with this interface. Applications must not override this method.
+ //!
+ InterfaceInfo getInterfaceInfo() const noexcept override
+ {
+ return InterfaceInfo{"IInt8MinMaxCalibrator", 1, 0};
+ }
+
//!
//! Signal that this is the MinMax Calibrator.
//!
@@ -8390,16 +7700,35 @@ class IInt8MinMaxCalibrator : public IInt8Calibrator
return CalibrationAlgoType::kMINMAX_CALIBRATION;
}
- virtual ~IInt8MinMaxCalibrator() noexcept = default;
+ ~IInt8MinMaxCalibrator() noexcept override = default;
};
+} // namespace v_1_0
//!
-//! Legacy calibrator left for backward compatibility with TensorRT 2.0. This calibrator requires user parameterization,
-//! and is provided as a fallback option if the other calibrators yield poor results.
+//! \class IInt8MinMaxCalibrator
+//!
+//! \brief MinMax Calibrator.
+//!
+//! It supports per activation tensor scaling.
+//!
+//! \note To ensure compatibility of source code with future versions of TensorRT, use IMinMaxCalibrator>, not
+//! v_1_0::IMinMaxCalibrator
//!
+using IInt8MinMaxCalibrator = v_1_0::IInt8MinMaxCalibrator;
+
+namespace v_1_0
+{
class IInt8LegacyCalibrator : public IInt8Calibrator
{
public:
+ //!
+ //! \brief Return version information associated with this interface. Applications must not override this method.
+ //!
+ InterfaceInfo getInterfaceInfo() const noexcept override
+ {
+ return InterfaceInfo{"IInt8Calibrator", 1, 0};
+ }
+
//!
//! Signal that this is the legacy calibrator.
//!
@@ -8448,8 +7777,22 @@ class IInt8LegacyCalibrator : public IInt8Calibrator
//!
virtual void writeHistogramCache(void const* ptr, std::size_t length) noexcept = 0;
- virtual ~IInt8LegacyCalibrator() noexcept = default;
+ ~IInt8LegacyCalibrator() noexcept override = default;
};
+} // namespace v_1_0
+
+//!
+//! \class IInt8LegacyCalibrator
+//!
+//! \brief Legacy calibrator.
+//!
+//! This calibrator requires user parameterization,
+//! and is provided as a fallback option if the other calibrators yield poor results.
+//!
+//! \note To ensure compatibility of source code with future versions of TensorRT, use ILegacyCalibrator, not
+//! v_1_0::ILegacyCalibrator
+//!
+using IInt8LegacyCalibrator = v_1_0::IInt8LegacyCalibrator;
//!
//! \class IAlgorithmIOInfo
@@ -8464,19 +7807,6 @@ class IInt8LegacyCalibrator : public IInt8Calibrator
class IAlgorithmIOInfo : public INoCopy
{
public:
- //!
- //! \brief Return TensorFormat of the input/output of algorithm.
- //!
- //! \deprecated Deprecated in TensorRT 8.6. The strides, data type, and vectorization
- //! information is sufficient to uniquely identify tensor formats.
- //!
- //! \return the tensor format
- //!
- TRT_DEPRECATED TensorFormat getTensorFormat() const noexcept
- {
- return mImpl->getTensorFormat();
- }
-
//!
//! \brief Return DataType of the input/output of algorithm.
//!
@@ -8572,6 +7902,7 @@ class IAlgorithmContext : public INoCopy
public:
//!
//! \brief Return name of the algorithm node.
+ //!
//! This is a unique identifier for the IAlgorithmContext.
//!
char const* getName() const noexcept
@@ -8581,6 +7912,7 @@ class IAlgorithmContext : public INoCopy
//!
//! \brief Get the minimum / optimum / maximum dimensions for input or output tensor.
+ //!
//! \param index Index of the input or output of the algorithm. Incremental numbers assigned to indices of inputs
//! and the outputs.
//! \param select Which of the minimum, optimum, or maximum dimensions to be queried.
@@ -8613,9 +7945,11 @@ class IAlgorithmContext : public INoCopy
//!
//! \class IAlgorithm
+//!
//! \brief Describes a variation of execution of a layer.
//! An algorithm is represented by IAlgorithmVariant and the IAlgorithmIOInfo for each of its inputs and outputs.
-//! An algorithm can be selected or reproduced using AlgorithmSelector::selectAlgorithms()."
+//! An algorithm can be selected or reproduced using AlgorithmSelector::selectAlgorithms().
+//!
//! \see IAlgorithmIOInfo, IAlgorithmVariant, IAlgorithmSelector::selectAlgorithms()
//!
//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
@@ -8623,21 +7957,6 @@ class IAlgorithmContext : public INoCopy
class IAlgorithm : public INoCopy
{
public:
- //!
- //! \brief Returns the format of an Algorithm input or output. Algorithm inputs are incrementally numbered first,
- //! followed by algorithm outputs.
- //! \param index Index of the input or output of the algorithm. Incremental numbers assigned to indices of inputs
- //! and the outputs.
- //!
- //! \return a reference to IAlgorithmIOInfo specified by index or the first algorithm if index is out of range.
- //!
- //! \deprecated Deprecated in TensorRT 8.0. Superseded by IAlgorithm::getAlgorithmIOInfoByIndex().
- //!
- TRT_DEPRECATED IAlgorithmIOInfo const& getAlgorithmIOInfo(int32_t index) const noexcept
- {
- return mImpl->getAlgorithmIOInfo(index);
- }
-
//!
//! \brief Returns the algorithm variant.
//!
@@ -8665,6 +7984,7 @@ class IAlgorithm : public INoCopy
//!
//! \brief Returns the format of an Algorithm input or output. Algorithm inputs are incrementally numbered first,
//! followed by algorithm outputs.
+ //!
//! \param index Index of the input or output of the algorithm. Incremental numbers assigned to indices of inputs
//! and the outputs.
//!
@@ -8680,17 +8000,18 @@ class IAlgorithm : public INoCopy
apiv::VAlgorithm* mImpl;
}; // IAlgorithm
-//!
-//! \class IAlgorithmSelector
-//!
-//! \brief Interface implemented by application for selecting and reporting algorithms of a layer provided by the
-//! builder.
-//! \note A layer in context of algorithm selection may be different from ILayer in INetworkDefiniton.
-//! For example, an algorithm might be implementing a conglomeration of multiple ILayers in INetworkDefinition.
-//!
-class IAlgorithmSelector
+namespace v_1_0
+{
+class IAlgorithmSelector : public IVersionedInterface
{
public:
+ //!
+ //! \brief Return version information associated with this interface. Applications must not override this method.
+ //!
+ InterfaceInfo getInterfaceInfo() const noexcept override
+ {
+ return InterfaceInfo{"IAlgorithmSelector", 1, 0};
+ }
//!
//! \brief Select Algorithms for a layer from the given list of algorithm choices.
//!
@@ -8702,11 +8023,12 @@ class IAlgorithmSelector
//!
//! \note TensorRT uses its default algorithm selection to choose from the list provided.
//! If return value is 0, TensorRT's default algorithm selection is used unless
- //! BuilderFlag::kREJECT_EMPTY_ALGORITHMS (or the deprecated BuilderFlag::kSTRICT_TYPES) is set.
+ //! BuilderFlag::kREJECT_EMPTY_ALGORITHMS is set.
//! The list of choices is valid only for this specific algorithm context.
//!
virtual int32_t selectAlgorithms(IAlgorithmContext const& context, IAlgorithm const* const* choices,
int32_t nbChoices, int32_t* selection) noexcept = 0;
+
//!
//! \brief Called by TensorRT to report choices it made.
//!
@@ -8722,6 +8044,19 @@ class IAlgorithmSelector
virtual ~IAlgorithmSelector() noexcept = default;
};
+} // namespace v_1_0
+
+//!
+//! \class IAlgorithmSelector
+//!
+//! \brief Interface implemented by application for selecting and reporting algorithms of a layer provided by the
+//! builder.
+//! \note A layer in context of algorithm selection may be different from ILayer in INetworkDefiniton.
+//! For example, an algorithm might be implementing a conglomeration of multiple ILayers in INetworkDefinition.
+//! \note To ensure compatibility of source code with future versions of TensorRT, use IAlgorithmSelector, not
+//! v_1_0::IAlgorithmSelector
+//!
+using IAlgorithmSelector = v_1_0::IAlgorithmSelector;
//!
//! \brief Represents one or more QuantizationFlag values using binary OR
@@ -8774,34 +8109,31 @@ using BuilderFlags = uint32_t;
//!
enum class BuilderFlag : int32_t
{
- kFP16 = 0, //!< Enable FP16 layer selection, with FP32 fallback.
- kINT8 = 1, //!< Enable Int8 layer selection, with FP32 fallback with FP16 fallback if kFP16 also specified.
- kDEBUG = 2, //!< Enable debugging of layers via synchronizing after every layer.
- kGPU_FALLBACK = 3, //!< Enable layers marked to execute on GPU if layer cannot execute on DLA.
+ //! Enable FP16 layer selection, with FP32 fallback.
+ kFP16 = 0,
- //! Legacy flag with effect similar to setting all of these three flags:
- //!
- //! * kPREFER_PRECISION_CONSTRAINTS
- //! * kDIRECT_IO
- //! * kREJECT_EMPTY_ALGORITHMS
- //!
- //! except that if the direct I/O requirement cannot be met and kDIRECT_IO was not explicitly set,
- //! instead of the build failing, the build falls back as if kDIRECT_IO was not set.
- //!
- //! \deprecated Deprecated in TensorRT 8.2.
- //!
- kSTRICT_TYPES TRT_DEPRECATED_ENUM = 4,
+ //! Enable Int8 layer selection, with FP32 fallback with FP16 fallback if kFP16 also specified.
+ kINT8 = 1,
- kREFIT = 5, //!< Enable building a refittable engine.
- kDISABLE_TIMING_CACHE = 6, //!< Disable reuse of timing information across identical layers.
+ //! Enable debugging of layers via synchronizing after every layer.
+ kDEBUG = 2,
+
+ //! Enable layers marked to execute on GPU if layer cannot execute on DLA.
+ kGPU_FALLBACK = 3,
+
+ //! Enable building a refittable engine.
+ kREFIT = 4,
+
+ //! Disable reuse of timing information across identical layers.
+ kDISABLE_TIMING_CACHE = 5,
//! Allow (but not require) computations on tensors of type DataType::kFLOAT to use TF32.
//! TF32 computes inner products by rounding the inputs to 10-bit mantissas before
//! multiplying, but accumulates the sum using 23-bit mantissas. Enabled by default.
- kTF32 = 7,
+ kTF32 = 6,
//! Allow the builder to examine weights and use optimized functions when weights have suitable sparsity.
- kSPARSE_WEIGHTS = 8,
+ kSPARSE_WEIGHTS = 7,
//! Change the allowed parameters in the EngineCapability::kSTANDARD flow to
//! match the restrictions that EngineCapability::kSAFETY check against for DeviceType::kGPU
@@ -8809,52 +8141,97 @@ enum class BuilderFlag : int32_t
//! is forced to true if EngineCapability::kSAFETY at build time if it is unset.
//!
//! This flag is only supported in NVIDIA Drive(R) products.
- kSAFETY_SCOPE = 9,
+ kSAFETY_SCOPE = 8,
//! Require that layers execute in specified precisions. Build fails otherwise.
- kOBEY_PRECISION_CONSTRAINTS = 10,
+ kOBEY_PRECISION_CONSTRAINTS = 9,
//! Prefer that layers execute in specified precisions.
//! Fall back (with warning) to another precision if build would otherwise fail.
- kPREFER_PRECISION_CONSTRAINTS = 11,
+ kPREFER_PRECISION_CONSTRAINTS = 10,
//! Require that no reformats be inserted between a layer and a network I/O tensor
//! for which ITensor::setAllowedFormats was called.
//! Build fails if a reformat is required for functional correctness.
- kDIRECT_IO = 12,
+ kDIRECT_IO = 11,
//! Fail if IAlgorithmSelector::selectAlgorithms returns an empty set of algorithms.
- kREJECT_EMPTY_ALGORITHMS = 13,
-
- //! Enable heuristic-based tactic selection for shorter engine generation time. The engine may not
- //! be as performant as when built with a profiling-based builder.
- //!
- //! This flag is only supported by NVIDIA Ampere and later GPUs.
- //! \deprecated Superseded by builder optimization level 2. Deprecated in TensorRT 8.6
- kENABLE_TACTIC_HEURISTIC = 14,
+ kREJECT_EMPTY_ALGORITHMS = 12,
//! Restrict to lean runtime operators to provide version forward compatibility
//! for the plan.
//!
- //! Using this flag with ICudaEngine::serialize() and BuilderFlag::kREFIT would result in error.
//! This flag is only supported by NVIDIA Volta and later GPUs.
//! This flag is not supported in NVIDIA Drive(R) products.
- //! This flag is not supported with implicit batch mode. Network must be created with
- //! NetworkDefinitionCreationFlag::kEXPLICIT_BATCH.
- kVERSION_COMPATIBLE = 15,
+ kVERSION_COMPATIBLE = 13,
//! Exclude lean runtime from the plan when version forward compatability is enabled.
//! By default, this flag is unset, so the lean runtime will be included in the plan.
//!
//! If BuilderFlag::kVERSION_COMPATIBLE is not set then the value of this flag will be ignored.
- //!
- //! This flag is not supported with implicit batch mode. Network must be created with
- //! NetworkDefinitionCreationFlag::kEXPLICIT_BATCH.
- kEXCLUDE_LEAN_RUNTIME = 16,
+ kEXCLUDE_LEAN_RUNTIME = 14,
//! Enable FP8 layer selection, with FP32 fallback.
- //! \warning kFP8 is not supported yet and will result in an error or undefined behavior.
- kFP8 = 17
+ //!
+ //! This flag is not supported with hardware-compatibility mode.
+ //!
+ //! \see HardwareCompatibilityLevel
+ kFP8 = 15,
+
+ //! Emit error when a tactic being timed is not present in the timing cache.
+ //! This flag has an effect only when IBuilderConfig has an associated ITimingCache.
+ kERROR_ON_TIMING_CACHE_MISS = 16,
+
+ //! Enable DataType::kBF16 layer selection, with FP32 fallback.
+ //! This flag is only supported by NVIDIA Ampere and later GPUs.
+ kBF16 = 17,
+
+ //! Disable caching of JIT-compilation results during engine build.
+ //! By default, JIT-compiled code will be serialized as part of the timing cache, which may significantly increase
+ //! the cache size. Setting this flag prevents the code from being serialized. This flag has an effect only when
+ //! BuilderFlag::DISABLE_TIMING_CACHE is not set.
+ kDISABLE_COMPILATION_CACHE = 18,
+
+ //! Strip the refittable weights from the engine plan file.
+ kSTRIP_PLAN = 19,
+
+ //! \deprecated Deprecated in TensorRT 10.0. Superseded by kSTRIP_PLAN.
+ kWEIGHTLESS TRT_DEPRECATED_ENUM = kSTRIP_PLAN,
+
+ //! Create a refittable engine under the assumption that the refit weights will be identical to those provided at
+ //! build time. The resulting engine will have the same performance as a non-refittable one. All refittable weights
+ //! can be refitted through the refit API, but if the refit weights are not identical to the build-time weights,
+ //! behavior is undefined. When used alongside 'kSTRIP_PLAN', this flag will result in a small plan file for which
+ //! weights are later supplied via refitting. This enables use of a single set of weights with different inference
+ //! backends, or with TensorRT plans for multiple GPU architectures.
+ kREFIT_IDENTICAL = 20,
+
+ //!
+ //! \brief Enable weight streaming for the current engine.
+ //!
+ //! Weight streaming from the host enables execution of models that do not fit
+ //! in GPU memory by allowing TensorRT to intelligently stream network weights
+ //! from the CPU DRAM. Please see ICudaEngine::getMinimumWeightStreamingBudget
+ //! for the default memory budget when this flag is enabled.
+ //!
+ //! Enabling this feature changes the behavior of
+ //! IRuntime::deserializeCudaEngine to allocate the entire network’s weights
+ //! on the CPU DRAM instead of GPU memory. Then,
+ //! ICudaEngine::createExecutionContext will determine the optimal split of
+ //! weights between the CPU and GPU and place weights accordingly.
+ //!
+ //! Future TensorRT versions may enable this flag by default.
+ //!
+ //! \warning Enabling this flag may marginally increase build time.
+ //!
+ //! \warning Enabling this feature will significantly increase the latency of
+ //! ICudaEngine::createExecutionContext.
+ //!
+ //! \see IRuntime::deserializeCudaEngine,
+ //! ICudaEngine::getMinimumWeightStreamingBudget,
+ //! ICudaEngine::setWeightStreamingBudget
+ //!
+ kWEIGHT_STREAMING = 21,
};
//!
@@ -8865,7 +8242,7 @@ enum class BuilderFlag : int32_t
template <>
constexpr inline int32_t EnumMax() noexcept
{
- return 18;
+ return 22;
}
//!
@@ -8946,7 +8323,6 @@ enum class MemoryPoolType : int32_t
{
//!
//! kWORKSPACE is used by TensorRT to store intermediate buffers within an operation.
- //! This is equivalent to the deprecated IBuilderConfig::setMaxWorkspaceSize and overrides that value.
//! This defaults to max device memory. Set to a smaller value to restrict tactics that use over the
//! threshold en masse. For more targeted removal of tactics use the IAlgorithmSelector
//! interface.
@@ -8957,7 +8333,7 @@ enum class MemoryPoolType : int32_t
//! kDLA_MANAGED_SRAM is a fast software managed RAM used by DLA to communicate within a layer.
//! The size of this pool must be at least 4 KiB and must be a power of 2.
//! This defaults to 1 MiB.
- //! Orin has capacity of 1 MiB per core, and Xavier shares 4 MiB across all of its accelerator cores.
+ //! Orin has capacity of 1 MiB per core.
//!
kDLA_MANAGED_SRAM = 1,
@@ -8983,6 +8359,17 @@ enum class MemoryPoolType : int32_t
//! cudaGetDeviceProperties.embedded is true, and 100% otherwise.
//!
kTACTIC_DRAM = 4,
+
+ //!
+ //! kTACTIC_SHARED_MEMORY defines the maximum shared memory size utilized for executing
+ //! the backend CUDA kernel implementation. Adjust this value to restrict tactics that exceed
+ //! the specified threshold en masse. The default value is device max capability. This value must
+ //! be less than 1GiB.
+ //!
+ //! Updating this flag will override the shared memory limit set by \ref HardwareCompatibilityLevel,
+ //! which defaults to 48KiB.
+ //!
+ kTACTIC_SHARED_MEMORY = 5,
};
//!
@@ -8993,7 +8380,7 @@ enum class MemoryPoolType : int32_t
template <>
constexpr inline int32_t EnumMax() noexcept
{
- return 5;
+ return 6;
}
//!
@@ -9006,40 +8393,12 @@ constexpr inline int32_t EnumMax() noexcept
//!
enum class PreviewFeature : int32_t
{
- //!
- //! Optimize runtime dimensions with TensorRT's DL Compiler.
- //! Potentially reduces run time and decreases device memory usage and engine size.
- //! Models most likely to benefit from enabling kFASTER_DYNAMIC_SHAPES_0805 are transformer-based models,
- //! and models containing dynamic control flows.
- //!
- //! The default value for this flag is on.
- //!
- //! \deprecated Turning it off is deprecated in TensorRT 8.6. The flag kFASTER_DYNAMIC_SHAPES_0805 will be removed in 9.0.
- //!
- kFASTER_DYNAMIC_SHAPES_0805 TRT_DEPRECATED_ENUM = 0,
-
- //!
- //! Disable usage of cuDNN/cuBLAS/cuBLASLt tactics in the TensorRT core library.
- //!
- //! When the flag is enabled, TensorRT core will not use these tactics even if they are specified in
- //! \ref IBuilderConfig::setTacticSources(), but cudnnContext and cublasContext handles will still be passed to
- //! plugins via IPluginV2Ext::attachToContext() if the appropriate tactic sources are set.
- //!
- //! This allows users to experiment with disabling external library tactics without having to modify their
- //! application's plugins to support nullptr handles.
- //!
- //! The default value for this flag is on.
- //!
- //! \see TacticSource
- //!
- kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805 = 1,
-
//!
//! Allows optimization profiles to be shared across execution contexts.
- //! This flag defaults to false and will become the default behavior in TensorRT 9.0.
- //! At that point this flag will do nothing.
//!
- kPROFILE_SHARING_0806 = 2,
+ //! \deprecated Deprecated in TensorRT 10.0. The default value for this flag is on and can not be changed.
+ //!
+ kPROFILE_SHARING_0806 TRT_DEPRECATED_ENUM = 0,
};
namespace impl
{
@@ -9051,13 +8410,20 @@ namespace impl
template <>
struct EnumMaxImpl
{
- static constexpr int32_t kVALUE = 3;
+ static constexpr int32_t kVALUE = 1;
};
} // namespace impl
-//! Describes requirements of compatibility with GPU architectures other than that of the GPU on which the engine was
-//! built. Levels except kNONE are only supported for engines built on NVIDIA Ampere and later GPUs.
-//! Note that compatibility with future hardware depends on CUDA forward compatibility support.
+//!
+//! \enum HardwareCompatibilityLevel
+//!
+//! \brief Describes requirements of compatibility with GPU architectures other than that of the GPU on which the engine was
+//! built.
+//!
+//! Levels except kNONE are only supported for engines built on NVIDIA Ampere and later GPUs.
+//!
+//! \warning Note that compatibility with future hardware depends on CUDA forward compatibility support.
+//!
enum class HardwareCompatibilityLevel : int32_t
{
//! Do not require hardware compatibility with GPU architectures other than that of the GPU on which the engine was
@@ -9085,48 +8451,105 @@ struct EnumMaxImpl
};
} // namespace impl
-//!
-//! \class IBuilderConfig
-//!
-//! \brief Holds properties for configuring a builder to produce an engine.
-//!
-//! \see BuilderFlags
-//!
-class IBuilderConfig : public INoCopy
+namespace v_1_0
+{
+class IProgressMonitor : public IVersionedInterface
{
public:
- virtual ~IBuilderConfig() noexcept = default;
+ IProgressMonitor() = default;
+ virtual ~IProgressMonitor() noexcept = default;
//!
- //! \brief Set the number of minimization iterations used when timing layers.
+ //! \brief Return version information associated with this interface. Applications must not override this method.
//!
- //! When timing layers, the builder minimizes over a set of average times for layer execution. This parameter
- //! controls the number of iterations used in minimization. The builder may sometimes run layers for more
- //! iterations to improve timing accuracy if this parameter is set to a small value and the runtime of the
- //! layer is short.
+ InterfaceInfo getInterfaceInfo() const noexcept override
+ {
+ return InterfaceInfo{"IProgressMonitor", 1, 0};
+ }
+
//!
- //! \see getMinTimingIterations()
+ //! \brief Signal that a phase of the optimizer has started.
//!
- //! \deprecated Deprecated in TensorRT 8.4. Superseded by setAvgTimingIterations().
+ //! \param phaseName The name of this phase for tracking purposes.
+ //! \param parentPhase The parent phase that this phase belongs to, or nullptr if there is no parent.
+ //! \param nbSteps The number of steps that are involved in this phase.
//!
- TRT_DEPRECATED virtual void setMinTimingIterations(int32_t minTiming) noexcept
- {
- mImpl->setMinTimingIterations(minTiming);
- }
+ //! The phaseStart function signals to the application that the current phase is beginning, and that it has a
+ //! certain number of steps to perform. If \p phaseParent is nullptr, then the phaseStart is beginning an
+ //! independent phase, and if \p phaseParent is specified, then the current phase, specified by \p phaseName, is
+ //! within the scope of the parent phase. \p nbSteps will always be a positive number. The phaseStart function
+ //! implies that the first step is being executed. TensorRT will signal when each step is complete.
+ //!
+ //! Phase names are human readable English strings which are unique within a single phase hierarchy but which can be
+ //! reused once the previous instance has completed. Phase names and their hierarchies may change between versions
+ //! of TensorRT.
+ //!
+ //! \see phaseFinish
+ //!
+ virtual void phaseStart(char const* phaseName, char const* parentPhase, int32_t nbSteps) noexcept = 0;
//!
- //! \brief Query the number of minimization iterations.
+ //! \brief Signal that a step of an optimizer phase has finished.
//!
- //! By default the minimum number of iterations is 1.
+ //! \param phaseName The name of the innermost phase being executed.
+ //! \param step The step number that was completed.
//!
- //! \see setMinTimingIterations()
+ //! The stepComplete function signals to the application that TensorRT has finished the current \p step for the
+ //! phase \p phaseName, and will move onto the next step if there is one. The application can return false for
+ //! TensorRT to exit the build early. The step value will increase on subsequent calls in the range [0, nbSteps).
//!
- //! \deprecated Deprecated in TensorRT 8.4. Superseded by getAvgTimingIterations().
+ //! \return true to continue to the next step or false to stop the build.
//!
- TRT_DEPRECATED virtual int32_t getMinTimingIterations() const noexcept
- {
- return mImpl->getMinTimingIterations();
- }
+ virtual bool stepComplete(char const* phaseName, int32_t step) noexcept = 0;
+
+ //!
+ //! \brief Signal that a phase of the optimizer has finished.
+ //!
+ //! \param phaseName The name of the phase that has finished.
+ //!
+ //! The phaseFinish function signals to the application that the phase is complete. This function may be called
+ //! before all steps in the range [0, nbSteps) have been reported to stepComplete. This scenario can be triggered by
+ //! error handling, internal optimizations, or when stepComplete returns false to request cancellation of the build.
+ //!
+ //! \see phaseStart
+ //!
+ virtual void phaseFinish(char const* phaseName) noexcept = 0;
+
+}; // class IProgressMonitor
+} // namespace v_1_0
+
+//!
+//! \class IProgressMonitor
+//!
+//! \brief Application-implemented progress reporting interface for TensorRT.
+//!
+//! The IProgressMonitor is a user-defined object that TensorRT uses to report back when an internal algorithm has
+//! started or finished a phase to help provide feedback on the progress of the optimizer.
+//!
+//! The IProgressMonitor will trigger its start function when a phase is entered and will trigger its finish function
+//! when that phase is exited. Each phase consists of one or more steps. When each step is completed, the stepComplete
+//! function is triggered. This will allow an application using the builder to communicate progress relative to when the
+//! optimization step is expected to complete.
+//!
+//! The implementation of IProgressMonitor must be thread-safe so that it can be called from multiple internal threads.
+//! The lifetime of the IProgressMonitor must exceed the lifetime of all TensorRT objects that use it.
+//!
+//! \note To ensure compatibility of source code with future versions of TensorRT, use IProgressMonitor, not
+//! v_1_0::IProgressMonitor
+//!
+using IProgressMonitor = v_1_0::IProgressMonitor;
+
+//!
+//! \class IBuilderConfig
+//!
+//! \brief Holds properties for configuring a builder to produce an engine.
+//!
+//! \see BuilderFlags
+//!
+class IBuilderConfig : public INoCopy
+{
+public:
+ virtual ~IBuilderConfig() noexcept = default;
//!
//! \brief Set the number of averaging iterations used when timing layers.
@@ -9196,38 +8619,6 @@ class IBuilderConfig : public INoCopy
return mImpl->getInt8Calibrator();
}
- //!
- //! \brief Set the maximum workspace size.
- //!
- //! \param workspaceSize The maximum GPU temporary memory which the engine can use at execution time.
- //!
- //! \see getMaxWorkspaceSize()
- //!
- //! \deprecated Deprecated in TensorRT 8.3. Superseded by IBuilderConfig::setMemoryPoolLimit() with
- //! MemoryPoolType::kWORKSPACE.
- //!
- TRT_DEPRECATED void setMaxWorkspaceSize(std::size_t workspaceSize) noexcept
- {
- mImpl->setMaxWorkspaceSize(workspaceSize);
- }
-
- //!
- //! \brief Get the maximum workspace size.
- //!
- //! By default the workspace size is the size of total global memory in the device.
- //!
- //! \return The maximum workspace size.
- //!
- //! \see setMaxWorkspaceSize()
- //!
- //! \deprecated Deprecated in TensorRT 8.3. Superseded by IBuilderConfig::getMemoryPoolLimit() with
- //! MemoryPoolType::kWORKSPACE.
- //!
- TRT_DEPRECATED std::size_t getMaxWorkspaceSize() const noexcept
- {
- return mImpl->getMaxWorkspaceSize();
- }
-
//!
//! \brief Set the build mode flags to turn on builder options for this network.
//!
@@ -9295,12 +8686,13 @@ class IBuilderConfig : public INoCopy
//!
//! \brief Set the device that this layer must execute on.
+ //!
//! \param layer which layer to execute.
//! \param deviceType that this layer must execute on.
//! If DeviceType is not set or is reset, TensorRT will use the default DeviceType set in the builder.
//!
//! \note The device type for a layer must be compatible with the safety flow (if specified).
- //! For example a layer cannot be marked for DLA execution while the builder is configured for kSAFE_GPU.
+ //! For example a layer cannot be marked for DLA execution while the builder is configured for kSAFETY.
//!
//! \see getDeviceType()
//!
@@ -9311,6 +8703,7 @@ class IBuilderConfig : public INoCopy
//!
//! \brief Get the device that this layer executes on.
+ //!
//! \return Returns DeviceType of the layer.
//!
DeviceType getDeviceType(ILayer const* layer) const noexcept
@@ -9320,7 +8713,9 @@ class IBuilderConfig : public INoCopy
//!
//! \brief whether the DeviceType has been explicitly set for this layer
+ //!
//! \return true if device type is not default
+ //!
//! \see setDeviceType() getDeviceType() resetDeviceType()
//!
bool isDeviceTypeSet(ILayer const* layer) const noexcept
@@ -9340,6 +8735,7 @@ class IBuilderConfig : public INoCopy
//!
//! \brief Checks if a layer can run on DLA.
+ //!
//! \return status true if the layer can on DLA else returns false.
//!
bool canRunOnDLA(ILayer const* layer) const noexcept
@@ -9349,6 +8745,7 @@ class IBuilderConfig : public INoCopy
//!
//! \brief Sets the DLA core used by the network. Defaults to -1.
+ //!
//! \param dlaCore The DLA core to execute the engine on, in the range [0,getNbDlaCores()).
//!
//! This function is used to specify which DLA core to use via indexing, if multiple DLA cores are available.
@@ -9364,6 +8761,7 @@ class IBuilderConfig : public INoCopy
//!
//! \brief Get the DLA core that the engine executes on.
+ //!
//! \return assigned DLA core or -1 for DLA not present or unset.
//!
int32_t getDLACore() const noexcept
@@ -9374,6 +8772,7 @@ class IBuilderConfig : public INoCopy
//!
//! \brief Sets the default DeviceType to be used by the builder. It ensures that all the layers that can run on
//! this device will run on it, unless setDeviceType is used to override the default DeviceType for a layer.
+ //!
//! \see getDefaultDeviceType()
//!
void setDefaultDeviceType(DeviceType deviceType) noexcept
@@ -9401,20 +8800,6 @@ class IBuilderConfig : public INoCopy
mImpl->reset();
}
- //!
- //! \brief Delete this IBuilderConfig.
- //!
- //! De-allocates any internally allocated memory.
- //!
- //! \deprecated Deprecated in TensorRT 8.0. Superseded by `delete`.
- //!
- //! \warning Calling destroy on a managed pointer will result in a double-free error.
- //!
- TRT_DEPRECATED void destroy() noexcept
- {
- delete this;
- }
-
//!
//! \brief Set the cuda stream that is used to profile this network.
//!
@@ -9447,6 +8832,7 @@ class IBuilderConfig : public INoCopy
//! a single optimization profile are not supported for refittable engines.
//!
//! \param profile The new optimization profile, which must satisfy profile->isValid() == true
+ //!
//! \return The index of the optimization profile (starting from 0) if the input is valid, or -1 if the input is
//! not valid.
//!
@@ -9518,6 +8904,7 @@ class IBuilderConfig : public INoCopy
//!
//! \param profile The new calibration profile, which must satisfy profile->isValid() == true or be nullptr.
//! MIN and MAX values will be overwritten by kOPT.
+ //!
//! \return True if the calibration profile was set correctly.
//!
bool setCalibrationProfile(IOptimizationProfile const* profile) noexcept
@@ -9783,6 +9170,19 @@ class IBuilderConfig : public INoCopy
//! which is currently 5. Setting it to greater than the maximum level results in behavior identical to the
//! maximum level.
//!
+ //! Below are the descriptions about each builder optimization level:
+ //!
+ //! - Level 0: This enables the fastest compilation by disabling dynamic kernel generation and selecting the first
+ //! tactic that succeeds in execution. This will also not respect a timing cache.
+ //! - Level 1: Available tactics are sorted by heuristics, but only the top are tested to select the best. If a
+ //! dynamic kernel is generated its compile optimization is low.
+ //! - Level 2: Available tactics are sorted by heuristics, but only the fastest tactics are tested to select the
+ //! best.
+ //! - Level 3: Apply heuristics to see if a static precompiled kernel is applicable or if a new one has to be
+ //! compiled dynamically.
+ //! - Level 4: Always compiles a dynamic kernel.
+ //! - Level 5: Always compiles a dynamic kernel and compares it to static kernels.
+ //!
//! \param level The optimization level to set to. Must be non-negative.
//!
//! \see getBuilderOptimizationLevel
@@ -9804,6 +9204,7 @@ class IBuilderConfig : public INoCopy
return mImpl->getBuilderOptimizationLevel();
}
+ //!
//! \brief Set the hardware compatibility level.
//!
//! Hardware compatibility allows an engine to run on GPU
@@ -9908,38 +9309,65 @@ class IBuilderConfig : public INoCopy
return mImpl->getMaxAuxStreams();
}
+ //!
+ //! \brief Sets the progress monitor for building a network.
+ //!
+ //! \param monitor The progress monitor to assign to the IBuilderConfig.
+ //!
+ //! The progress monitor signals to the application when different phases of
+ //! the compiler are being executed. Setting to nullptr unsets the monitor so
+ //! that the application is not signaled.
+ //!
+ //! \see IBuilderConfig::getProgressMonitor
+ //!
+ void setProgressMonitor(IProgressMonitor* monitor) noexcept
+ {
+ return mImpl->setProgressMonitor(monitor);
+ }
+
+ //!
+ //! \return The progress monitor set by the application or nullptr.
+ //!
+ //! \see IBuilderConfig::setProgressMonitor
+ //!
+ IProgressMonitor* getProgressMonitor() const noexcept
+ {
+ return mImpl->getProgressMonitor();
+ }
+
protected:
apiv::VBuilderConfig* mImpl;
};
+//!
//! \brief Represents one or more NetworkDefinitionCreationFlag flags
//! using binary OR operations.
-//! e.g., 1U << NetworkDefinitionCreationFlag::kEXPLICIT_BATCH
+//! e.g., 1U << NetworkDefinitionCreationFlag::kSTRONGLY_TYPED
//!
//! \see IBuilder::createNetworkV2
//!
using NetworkDefinitionCreationFlags = uint32_t;
+//!
//! \enum NetworkDefinitionCreationFlag
//!
//! \brief List of immutable network properties expressed at network creation time.
//! NetworkDefinitionCreationFlag is used with createNetworkV2() to specify immutable properties of the network.
-//! Creating a network without NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag has been deprecated.
//!
//! \see IBuilder::createNetworkV2
//!
enum class NetworkDefinitionCreationFlag : int32_t
{
- //! Mark the network to be an explicit batch network.
- //! Dynamic shape support requires that the kEXPLICIT_BATCH flag is set.
- //! With dynamic shapes, any of the input dimensions can vary at run-time,
- //! and there are no implicit dimensions in the network specification.
- //! Varying dimensions are specified by using the wildcard dimension value -1.
- kEXPLICIT_BATCH = 0,
-
- //! Deprecated. This flag has no effect now, but is only kept for backward compatability.
+ //! Ignored because networks are always "explicit batch" in TensorRT 10.0.
//!
- kEXPLICIT_PRECISION TRT_DEPRECATED_ENUM = 1,
+ //! \deprecated Deprecated in TensorRT 10.0.
+ kEXPLICIT_BATCH TRT_DEPRECATED_ENUM = 0,
+
+ //! Mark the network to be strongly typed.
+ //! Every tensor in the network has a data type defined in the network following only type inference rules and the
+ //! inputs/operator annotations. Setting layer precision and layer output types is not allowed, and the network
+ //! output types will be inferred based on the input types and the type inference rules.
+ kSTRONGLY_TYPED = 1,
};
//!
@@ -9965,36 +9393,6 @@ class IBuilder : public INoCopy
public:
virtual ~IBuilder() noexcept = default;
- //!
- //! \brief Set the maximum batch size. This has no effect for networks created with explicit batch dimension mode.
- //!
- //! \param batchSize The maximum batch size which can be used at execution time, and also the batch size for which
- //! the engine will be optimized.
- //!
- //! \deprecated Deprecated in TensorRT 8.4.
- //!
- //! \see getMaxBatchSize()
- //!
- TRT_DEPRECATED void setMaxBatchSize(int32_t batchSize) noexcept
- {
- mImpl->setMaxBatchSize(batchSize);
- }
-
- //!
- //! \brief Get the maximum batch size.
- //!
- //! \return The maximum batch size.
- //!
- //! \deprecated Deprecated in TensorRT 8.4.
- //!
- //! \see setMaxBatchSize()
- //! \see getMaxDLABatchSize()
- //!
- TRT_DEPRECATED int32_t getMaxBatchSize() const noexcept
- {
- return mImpl->getMaxBatchSize();
- }
-
//!
//! \brief Determine whether the platform has fast native fp16.
//!
@@ -10011,18 +9409,6 @@ class IBuilder : public INoCopy
return mImpl->platformHasFastInt8();
}
- //!
- //! \brief Destroy this object.
- //!
- //! \deprecated Deprecated in TensorRT 8.0. Superseded by `delete`.
- //!
- //! \warning Calling destroy on a managed pointer will result in a double-free error.
- //!
- TRT_DEPRECATED void destroy() noexcept
- {
- delete this;
- }
-
//!
//! \brief Get the maximum batch size DLA can support.
//! For any tensor the total volume of index dimensions combined(dimensions other than CHW) with the requested
@@ -10045,6 +9431,7 @@ class IBuilder : public INoCopy
//!
//! \brief Set the GPU allocator.
+ //!
//! \param allocator Set the GPU allocator to be used by the builder. All GPU memory acquired will use this
//! allocator. If NULL is passed, the default allocator will be used.
//!
@@ -10070,30 +9457,19 @@ class IBuilder : public INoCopy
}
//!
- //! \brief Builds an engine for the given INetworkDefinition and given IBuilderConfig.
- //!
- //! It enables the builder to build multiple engines based on the same network definition, but with different
- //! builder configurations.
+ //! \brief Create a network definition object
//!
- //! \note This function will synchronize the cuda stream returned by \p config.getProfileStream() before returning.
+ //! Creates a network definition object with immutable properties specified using the flags parameter.
//!
- //! \deprecated Deprecated in TensorRT 8.0. Superseded by IBuilder::buildSerializedNetwork().
+ //! createNetworkV2 supports creating network with properties from NetworkDefinitionCreationFlags.
//!
- TRT_DEPRECATED nvinfer1::ICudaEngine* buildEngineWithConfig(
- INetworkDefinition& network, IBuilderConfig& config) noexcept
- {
- return mImpl->buildEngineWithConfig(network, config);
- }
-
- //! \brief Create a network definition object
+ //! CreateNetworkV2 supports dynamic shapes and explicit batch dimensions by default.
//!
- //! Creates a network definition object with immutable properties specified using the flags parameter.
- //! CreateNetworkV2 supports dynamic shapes and explicit batch dimensions when used with
- //! NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag.
- //! Creating a network without NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag has been deprecated.
+ //! createNetworkV2 with NetworkDefinitionCreationFlag::kSTRONGLY_TYPED flag supports creating a strongly typed plan
+ //! where tensor data types are inferred from network input types and operator type specification.
//!
//! \param flags Bitset of NetworkDefinitionCreationFlags specifying network properties combined with bitwise OR.
- //! e.g., 1U << NetworkDefinitionCreationFlag::kEXPLICIT_BATCH
+ //! e.g., 1U << NetworkDefinitionCreationFlag::kSTRONGLY_TYPED
//!
//! \see INetworkDefinition, NetworkDefinitionCreationFlags
//!
@@ -10102,6 +9478,7 @@ class IBuilder : public INoCopy
return mImpl->createNetworkV2(flags);
}
+ //!
//! \brief Create a new optimization profile.
//!
//! If the network has any dynamic input tensors, the appropriate calls to setDimensions() must be made.
@@ -10127,7 +9504,7 @@ class IBuilder : public INoCopy
//! If an error recorder is not set, messages will be sent to the global log stream.
//!
//! \param recorder The error recorder to register with this interface.
- //
+ //!
//! \see getErrorRecorder()
//!
void setErrorRecorder(IErrorRecorder* recorder) noexcept
@@ -10202,8 +9579,6 @@ class IBuilder : public INoCopy
//!
//! \note This function will synchronize the cuda stream returned by \p config.getProfileStream() before returning.
//!
- //! This function is only supported in NVIDIA Drive(R) products.
- //!
bool isNetworkSupported(INetworkDefinition const& network, IBuilderConfig const& config) const noexcept
{
return mImpl->isNetworkSupported(network, config);
@@ -10221,7 +9596,9 @@ class IBuilder : public INoCopy
//!
//! \brief Set the maximum number of threads.
+ //!
//! \param maxThreads The maximum number of threads that can be used by the builder.
+ //!
//! \return True if successful, false otherwise.
//!
//! The default value is 1 and includes the current thread.
diff --git a/include/NvInferConsistency.h b/include/NvInferConsistency.h
index a70249b6..5096c3f4 100644
--- a/include/NvInferConsistency.h
+++ b/include/NvInferConsistency.h
@@ -44,7 +44,7 @@ class IConsistencyChecker
public:
//!
//! \brief Check that a blob that was input to createConsistencyChecker method represents a valid engine.
- //
+ //!
//! \return true if the original blob encoded an engine that belongs to valid engine domain with
//! target capability EngineCapability::kSAFETY, false otherwise.
//!
diff --git a/include/NvInferImpl.h b/include/NvInferImpl.h
index 38617246..1c2dbff8 100644
--- a/include/NvInferImpl.h
+++ b/include/NvInferImpl.h
@@ -26,11 +26,40 @@
namespace nvinfer1
{
+namespace v_1_0
+{
+class IProgressMonitor;
+}
+using IProgressMonitor = v_1_0::IProgressMonitor;
+
+namespace v_1_0
+{
+class IAlgorithmSelector;
+}
+using IAlgorithmSelector = v_1_0::IAlgorithmSelector;
+
+namespace v_1_0
+{
+class IProfiler;
+}
+using IProfiler = v_1_0::IProfiler;
+
+namespace v_1_0
+{
+class IOutputAllocator;
+}
+using IOutputAllocator = v_1_0::IOutputAllocator;
+
+namespace v_1_0
+{
+class IDebugListener;
+}
+using IDebugListener = v_1_0::IDebugListener;
+
class IActivationLayer;
class IAlgorithm;
class IAlgorithmContext;
class IAlgorithmIOInfo;
-class IAlgorithmSelector;
class IAlgorithmVariant;
class IAssertionLayer;
class IBuilder;
@@ -48,7 +77,6 @@ class IElementWiseLayer;
class IEngineInspector;
class IExecutionContext;
class IFillLayer;
-class IFullyConnectedLayer;
class IGatherLayer;
class IGridSampleLayer;
class IHostMemory;
@@ -70,7 +98,6 @@ class INMSLayer;
class INonZeroLayer;
class IOneHotLayer;
class IOptimizationProfile;
-class IOutputAllocator;
class IPaddingLayer;
class IParametricReLULayer;
class IPlugin;
@@ -79,19 +106,27 @@ class IPluginFactory;
class IPluginLayer;
class IPluginRegistry;
class IPluginV2Layer;
+
+namespace v_1_0
+{
+class IPluginV3;
+} // namespace v_1_0
+using IPluginV3 = v_1_0::IPluginV3;
+
+class IPluginV3Layer;
class IPoolingLayer;
-class IProfiler;
class IQuantizeLayer;
class IRaggedSoftMaxLayer;
class IRecurrenceLayer;
class IReduceLayer;
+class IRefitter;
class IResizeLayer;
class IReverseSequenceLayer;
-class IRNNv2Layer;
class IRuntime;
class IScaleLayer;
class IScatterLayer;
class ISelectLayer;
+class ISerializationConfig;
class IShapeLayer;
class IShuffleLayer;
class ISliceLayer;
@@ -130,13 +165,10 @@ enum class ResizeCoordinateTransformation : int32_t;
enum class InterpolationMode : int32_t;
enum class ResizeRoundMode : int32_t;
enum class ResizeSelector : int32_t;
-enum class RNNDirection : int32_t;
-enum class RNNGateType : int32_t;
-enum class RNNInputMode : int32_t;
-enum class RNNOperation : int32_t;
enum class ScaleMode : int32_t;
enum class ScatterMode : int32_t;
enum class SampleMode : int32_t;
+enum class SerializationFlag : int32_t;
enum class TensorIOMode : int32_t;
enum class TensorLocation : int32_t;
enum class TopKOperation : int32_t;
@@ -145,6 +177,7 @@ enum class UnaryOperation : int32_t;
enum class WeightsRole : int32_t;
enum class PreviewFeature : int32_t;
enum class HardwareCompatibilityLevel : int32_t;
+enum class ExecutionContextAllocationStrategy : int32_t;
using TacticSources = uint32_t;
using TensorFormats = uint32_t;
@@ -152,8 +185,7 @@ using BuilderFlags = uint32_t;
using NetworkDefinitionCreationFlags = uint32_t;
using QuantizationFlags = uint32_t;
using TempfileControlFlags = uint32_t;
-using ResizeMode = InterpolationMode;
-using SliceMode = SampleMode;
+using SerializationFlags = uint32_t;
//!
//! \file NvInferImpl.h
@@ -184,23 +216,28 @@ class VDimensionExpr : public VRoot
{
public:
virtual bool isConstant() const = 0;
- virtual int32_t getConstantValue() const = 0;
+ virtual int64_t getConstantValue() const = 0;
+ virtual bool isSizeTensor() const = 0;
};
class VExprBuilder : public VRoot
{
public:
- virtual IDimensionExpr const* constant(int32_t value) = 0;
+ virtual IDimensionExpr const* constant(int64_t value) = 0;
virtual IDimensionExpr const* operation(
DimensionOperation op, IDimensionExpr const& first, IDimensionExpr const& second)
= 0;
+ virtual IDimensionExpr const* declareSizeTensor(
+ int32_t outputIndex, IDimensionExpr const& opt, IDimensionExpr const& upper)
+ = 0;
};
class VRuntime : public VRoot
{
public:
- virtual nvinfer1::ICudaEngine* deserializeCudaEngine(
- void const* blob, std::size_t size, IPluginFactory* pluginFactory) noexcept = 0;
+ virtual IRuntime* getPImpl() noexcept = 0;
+ virtual nvinfer1::ICudaEngine* deserializeCudaEngine(void const* blob, std::size_t size) noexcept = 0;
+ virtual nvinfer1::ICudaEngine* deserializeCudaEngine(IStreamReader& streamReader) noexcept = 0;
virtual void setDLACore(int32_t dlaCore) noexcept = 0;
virtual int32_t getDLACore() const noexcept = 0;
virtual int32_t getNbDLACores() const noexcept = 0;
@@ -214,7 +251,6 @@ class VRuntime : public VRoot
virtual char const* getTemporaryDirectory() const noexcept = 0;
virtual void setTempfileControlFlags(TempfileControlFlags) noexcept = 0;
virtual TempfileControlFlags getTempfileControlFlags() const noexcept = 0;
- virtual IRuntime* getPImpl() noexcept = 0;
virtual IPluginRegistry& getPluginRegistry() noexcept = 0;
virtual void setPluginRegistryParent(IPluginRegistry* parent) noexcept = 0;
virtual IRuntime* loadRuntime(char const* path) noexcept = 0;
@@ -225,6 +261,7 @@ class VRuntime : public VRoot
class VRefitter : public VRoot
{
public:
+ virtual IRefitter* getPImpl() noexcept = 0;
virtual bool setWeights(char const* layerName, WeightsRole role, const Weights weights) noexcept = 0;
virtual bool refitCudaEngine() noexcept = 0;
virtual int32_t getMissing(int32_t size, char const** layerNames, WeightsRole* roles) noexcept = 0;
@@ -241,12 +278,20 @@ class VRefitter : public VRoot
virtual ILogger* getLogger() const noexcept = 0;
virtual bool setMaxThreads(int32_t maxThreads) noexcept = 0;
virtual int32_t getMaxThreads() const noexcept = 0;
+ virtual bool setNamedWeightsWithLocation(char const* name, Weights weights, TensorLocation location) noexcept = 0;
+ virtual Weights getNamedWeights(char const* weightsName) const noexcept = 0;
+ virtual TensorLocation getWeightsLocation(char const* weightsName) const noexcept = 0;
+ virtual bool unsetNamedWeights(char const* weightsName) noexcept = 0;
+ virtual void setWeightsValidation(bool weightsValidation) noexcept = 0;
+ virtual bool getWeightsValidation() const noexcept = 0;
+ virtual bool refitCudaEngineAsync(cudaStream_t stream) noexcept = 0;
+ virtual Weights getWeightsPrototype(char const* weightsName) const noexcept = 0;
};
class VOptimizationProfile : public VRoot
{
public:
- virtual bool setDimensions(char const* inputName, OptProfileSelector select, Dims dims) noexcept = 0;
+ virtual bool setDimensions(char const* inputName, OptProfileSelector select, Dims const& dims) noexcept = 0;
virtual Dims getDimensions(char const* inputName, OptProfileSelector select) const noexcept = 0;
virtual bool setShapeValues(
char const* inputName, OptProfileSelector select, int32_t const* values, int32_t nbValues) noexcept = 0;
@@ -260,33 +305,17 @@ class VOptimizationProfile : public VRoot
class VCudaEngine : public VRoot
{
public:
- virtual int32_t getNbBindings() const noexcept = 0;
- virtual int32_t getBindingIndex(char const* name) const noexcept = 0;
- virtual char const* getBindingName(int32_t bindingIndex) const noexcept = 0;
- virtual bool bindingIsInput(int32_t bindingIndex) const noexcept = 0;
- virtual Dims getBindingDimensions(int32_t bindingIndex) const noexcept = 0;
- virtual DataType getBindingDataType(int32_t bindingIndex) const noexcept = 0;
- virtual int32_t getMaxBatchSize() const noexcept = 0;
+ virtual ICudaEngine* getPImpl() noexcept = 0;
virtual int32_t getNbLayers() const noexcept = 0;
virtual IHostMemory* serialize() const noexcept = 0;
- virtual IExecutionContext* createExecutionContext() noexcept = 0;
- virtual TensorLocation getLocation(int32_t bindingIndex) const noexcept = 0;
+ virtual IExecutionContext* createExecutionContext(ExecutionContextAllocationStrategy strategy) noexcept = 0;
virtual IExecutionContext* createExecutionContextWithoutDeviceMemory() noexcept = 0;
virtual size_t getDeviceMemorySize() const noexcept = 0;
virtual bool isRefittable() const noexcept = 0;
- virtual int32_t getBindingBytesPerComponent(int32_t bindingIndex) const noexcept = 0;
- virtual int32_t getBindingComponentsPerElement(int32_t bindingIndex) const noexcept = 0;
- virtual TensorFormat getBindingFormat(int32_t bindingIndex) const noexcept = 0;
- virtual char const* getBindingFormatDesc(int32_t bindingIndex) const noexcept = 0;
- virtual int32_t getBindingVectorizedDim(int32_t bindingIndex) const noexcept = 0;
virtual char const* getName() const noexcept = 0;
virtual int32_t getNbOptimizationProfiles() const noexcept = 0;
- virtual Dims getProfileDimensions(
- int32_t bindingIndex, int32_t profileIndex, OptProfileSelector select) const noexcept = 0;
- virtual int32_t const* getProfileShapeValues(
- int32_t profileIndex, int32_t inputIndex, OptProfileSelector select) const noexcept = 0;
- virtual bool isShapeBinding(int32_t bindingIndex) const noexcept = 0;
- virtual bool isExecutionBinding(int32_t bindingIndex) const noexcept = 0;
+ virtual int32_t const* getProfileTensorValues(
+ char const* tensorName, int32_t profileIndex, OptProfileSelector select) const noexcept = 0;
virtual EngineCapability getEngineCapability() const noexcept = 0;
virtual void setErrorRecorder(IErrorRecorder* recorder) noexcept = 0;
virtual IErrorRecorder* getErrorRecorder() const noexcept = 0;
@@ -309,7 +338,6 @@ class VCudaEngine : public VRoot
virtual int32_t getNbIOTensors() const noexcept = 0;
virtual char const* getIOTensorName(int32_t index) const noexcept = 0;
virtual HardwareCompatibilityLevel getHardwareCompatibilityLevel() const noexcept = 0;
- virtual ICudaEngine* getPImpl() noexcept = 0;
virtual int32_t getNbAuxStreams() const noexcept = 0;
virtual int32_t getTensorBytesPerComponentV2(char const* tensorName, int32_t profileIndex) const noexcept = 0;
@@ -317,15 +345,25 @@ class VCudaEngine : public VRoot
virtual TensorFormat getTensorFormatV2(char const* tensorName, int32_t profileIndex) const noexcept = 0;
virtual char const* getTensorFormatDescV2(char const* tensorName, int32_t profileIndex) const noexcept = 0;
virtual int32_t getTensorVectorizedDimV2(char const* tensorName, int32_t profileIndex) const noexcept = 0;
+
+ virtual ISerializationConfig* createSerializationConfig() noexcept = 0;
+ virtual IHostMemory* serializeWithConfig(ISerializationConfig& config) const noexcept = 0;
+
+ virtual size_t getDeviceMemorySizeForProfile(int32_t profileIndex) const noexcept = 0;
+ virtual IRefitter* createRefitter(ILogger& logger) noexcept = 0;
+
+ virtual bool setWeightStreamingBudget(int64_t gpuMemoryBudget) noexcept = 0;
+ virtual int64_t getWeightStreamingBudget() const noexcept = 0;
+ virtual int64_t getMinimumWeightStreamingBudget() const noexcept = 0;
+ virtual int64_t getStreamableWeightsSize() const noexcept = 0;
+
+ virtual bool isDebugTensor(char const* name) const noexcept = 0;
};
class VExecutionContext : public VRoot
{
public:
- virtual bool execute(int32_t batchSize, void* const* bindings) noexcept = 0;
- virtual bool enqueue(
- int32_t batchSize, void* const* bindings, cudaStream_t stream, cudaEvent_t* inputConsumed) noexcept
- = 0;
+ virtual IExecutionContext* getPImpl() noexcept = 0;
virtual void setDebugSync(bool sync) noexcept = 0;
virtual bool getDebugSync() const noexcept = 0;
virtual void setProfiler(IProfiler*) noexcept = 0;
@@ -334,19 +372,12 @@ class VExecutionContext : public VRoot
virtual void setName(char const* name) noexcept = 0;
virtual char const* getName() const noexcept = 0;
virtual void setDeviceMemory(void* memory) noexcept = 0;
- virtual Dims getStrides(int32_t bindingIndex) const noexcept = 0;
- virtual bool setOptimizationProfile(int32_t profileIndex) noexcept = 0;
virtual int32_t getOptimizationProfile() const noexcept = 0;
- virtual bool setBindingDimensions(int32_t bindingIndex, Dims dimensions) noexcept = 0;
- virtual Dims getBindingDimensions(int32_t bindingIndex) const noexcept = 0;
- virtual bool setInputShapeBinding(int32_t bindingIndex, int32_t const* data) noexcept = 0;
- virtual bool getShapeBinding(int32_t bindingIndex, int32_t* data) const noexcept = 0;
virtual bool allInputDimensionsSpecified() const noexcept = 0;
virtual bool allInputShapesSpecified() const noexcept = 0;
virtual void setErrorRecorder(IErrorRecorder* recorder) noexcept = 0;
virtual IErrorRecorder* getErrorRecorder() const noexcept = 0;
virtual bool executeV2(void* const* bindings) noexcept = 0;
- virtual bool enqueueV2(void* const* bindings, cudaStream_t stream, cudaEvent_t* inputConsumed) noexcept = 0;
virtual bool setOptimizationProfileAsync(int32_t profileIndex, cudaStream_t stream) noexcept = 0;
virtual void setEnqueueEmitsProfile(bool enqueueEmitsProfile) noexcept = 0;
virtual bool getEnqueueEmitsProfile() const noexcept = 0;
@@ -357,6 +388,7 @@ class VExecutionContext : public VRoot
virtual bool setTensorAddress(char const* tensorName, void* data) noexcept = 0;
virtual void const* getTensorAddress(char const* tensorName) const noexcept = 0;
virtual bool setInputTensorAddress(char const* tensorName, void const* data) noexcept = 0;
+ virtual bool setOutputTensorAddress(char const* tensorName, void* data) noexcept = 0;
virtual int32_t inferShapes(int32_t nbMaxNames, char const** tensorNames) noexcept = 0;
virtual bool setInputConsumedEvent(cudaEvent_t event) noexcept = 0;
virtual cudaEvent_t getInputConsumedEvent() const noexcept = 0;
@@ -371,20 +403,25 @@ class VExecutionContext : public VRoot
virtual size_t getPersistentCacheLimit() const noexcept = 0;
virtual bool setNvtxVerbosity(ProfilingVerbosity verbosity) noexcept = 0;
virtual ProfilingVerbosity getNvtxVerbosity() const noexcept = 0;
- virtual IExecutionContext* getPImpl() noexcept = 0;
virtual void setAuxStreams(cudaStream_t* auxStreams, int32_t nbStreams) noexcept = 0;
+ virtual bool setDebugListener(IDebugListener* listener) noexcept = 0;
+ virtual IDebugListener* getDebugListener() noexcept = 0;
+ virtual bool setTensorDebugState(char const* name, bool flag) noexcept = 0;
+ virtual bool getDebugState(char const* name) const noexcept = 0;
+ virtual bool setAllTensorsDebugState(bool flag) noexcept = 0;
+ virtual size_t updateDeviceMemorySizeForShapes() noexcept = 0;
};
class VEngineInspector : public VRoot
{
public:
+ virtual IEngineInspector* getPImpl() noexcept = 0;
virtual bool setExecutionContext(IExecutionContext const* context) noexcept = 0;
virtual IExecutionContext const* getExecutionContext() const noexcept = 0;
virtual char const* getLayerInformation(int32_t layerIndex, LayerInformationFormat format) const noexcept = 0;
virtual char const* getEngineInformation(LayerInformationFormat format) const noexcept = 0;
virtual void setErrorRecorder(IErrorRecorder* recorder) noexcept = 0;
virtual IErrorRecorder* getErrorRecorder() const noexcept = 0;
- virtual IEngineInspector* getPImpl() noexcept = 0;
};
class VTensor : public VRoot
@@ -392,7 +429,7 @@ class VTensor : public VRoot
public:
virtual void setName(char const* name) noexcept = 0;
virtual char const* getName() const noexcept = 0;
- virtual void setDimensions(Dims dimensions) noexcept = 0;
+ virtual void setDimensions(Dims const& dimensions) noexcept = 0;
virtual Dims getDimensions() const noexcept = 0;
virtual void setType(DataType type) noexcept = 0;
virtual DataType getType() const noexcept = 0;
@@ -440,49 +477,30 @@ class VLayer : public VRoot
class VConvolutionLayer : public VRoot
{
public:
- virtual void setKernelSize(DimsHW kernelSize) noexcept = 0;
- virtual DimsHW getKernelSize() const noexcept = 0;
- virtual void setNbOutputMaps(int32_t nbOutputMaps) noexcept = 0;
- virtual int32_t getNbOutputMaps() const noexcept = 0;
- virtual void setStride(DimsHW stride) noexcept = 0;
- virtual DimsHW getStride() const noexcept = 0;
- virtual void setPadding(DimsHW padding) noexcept = 0;
- virtual DimsHW getPadding() const noexcept = 0;
- virtual void setNbGroups(int32_t nbGroups) noexcept = 0;
- virtual int32_t getNbGroups() const noexcept = 0;
+ virtual void setNbOutputMaps(int64_t nbOutputMaps) noexcept = 0;
+ virtual int64_t getNbOutputMaps() const noexcept = 0;
+ virtual void setNbGroups(int64_t nbGroups) noexcept = 0;
+ virtual int64_t getNbGroups() const noexcept = 0;
virtual void setKernelWeights(Weights weights) noexcept = 0;
virtual Weights getKernelWeights() const noexcept = 0;
virtual void setBiasWeights(Weights weights) noexcept = 0;
virtual Weights getBiasWeights() const noexcept = 0;
- virtual void setDilation(DimsHW dilation) noexcept = 0;
- virtual DimsHW getDilation() const noexcept = 0;
- virtual void setPrePadding(Dims padding) noexcept = 0;
+ virtual void setPrePadding(Dims const& padding) noexcept = 0;
virtual Dims getPrePadding() const noexcept = 0;
- virtual void setPostPadding(Dims padding) noexcept = 0;
+ virtual void setPostPadding(Dims const& padding) noexcept = 0;
virtual Dims getPostPadding() const noexcept = 0;
virtual void setPaddingMode(PaddingMode paddingMode) noexcept = 0;
virtual PaddingMode getPaddingMode() const noexcept = 0;
- virtual void setKernelSizeNd(Dims kernelSize) noexcept = 0;
+ virtual void setKernelSizeNd(Dims const& kernelSize) noexcept = 0;
virtual Dims getKernelSizeNd() const noexcept = 0;
- virtual void setStrideNd(Dims stride) noexcept = 0;
+ virtual void setStrideNd(Dims const& stride) noexcept = 0;
virtual Dims getStrideNd() const noexcept = 0;
- virtual void setPaddingNd(Dims padding) noexcept = 0;
+ virtual void setPaddingNd(Dims const& padding) noexcept = 0;
virtual Dims getPaddingNd() const noexcept = 0;
- virtual void setDilationNd(Dims dilation) noexcept = 0;
+ virtual void setDilationNd(Dims const& dilation) noexcept = 0;
virtual Dims getDilationNd() const noexcept = 0;
};
-class VFullyConnectedLayer : public VRoot
-{
-public:
- virtual void setNbOutputChannels(int32_t nbOutputs) noexcept = 0;
- virtual int32_t getNbOutputChannels() const noexcept = 0;
- virtual void setKernelWeights(Weights weights) noexcept = 0;
- virtual Weights getKernelWeights() const noexcept = 0;
- virtual void setBiasWeights(Weights weights) noexcept = 0;
- virtual Weights getBiasWeights() const noexcept = 0;
-};
-
class VActivationLayer : public VRoot
{
public:
@@ -499,35 +517,29 @@ class VPoolingLayer : public VRoot
public:
virtual void setPoolingType(PoolingType type) noexcept = 0;
virtual PoolingType getPoolingType() const noexcept = 0;
- virtual void setWindowSize(DimsHW windowSize) noexcept = 0;
- virtual DimsHW getWindowSize() const noexcept = 0;
- virtual void setStride(DimsHW stride) noexcept = 0;
- virtual DimsHW getStride() const noexcept = 0;
- virtual void setPadding(DimsHW padding) noexcept = 0;
- virtual DimsHW getPadding() const noexcept = 0;
virtual void setBlendFactor(float blendFactor) noexcept = 0;
virtual float getBlendFactor() const noexcept = 0;
virtual void setAverageCountExcludesPadding(bool exclusive) noexcept = 0;
virtual bool getAverageCountExcludesPadding() const noexcept = 0;
- virtual void setPrePadding(Dims padding) noexcept = 0;
+ virtual void setPrePadding(Dims const& padding) noexcept = 0;
virtual Dims getPrePadding() const noexcept = 0;
- virtual void setPostPadding(Dims padding) noexcept = 0;
+ virtual void setPostPadding(Dims const& padding) noexcept = 0;
virtual Dims getPostPadding() const noexcept = 0;
virtual void setPaddingMode(PaddingMode paddingMode) noexcept = 0;
virtual PaddingMode getPaddingMode() const noexcept = 0;
- virtual void setWindowSizeNd(Dims windowSize) noexcept = 0;
+ virtual void setWindowSizeNd(Dims const& windowSize) noexcept = 0;
virtual Dims getWindowSizeNd() const noexcept = 0;
- virtual void setStrideNd(Dims stride) noexcept = 0;
+ virtual void setStrideNd(Dims const& stride) noexcept = 0;
virtual Dims getStrideNd() const noexcept = 0;
- virtual void setPaddingNd(Dims padding) noexcept = 0;
+ virtual void setPaddingNd(Dims const& padding) noexcept = 0;
virtual Dims getPaddingNd() const noexcept = 0;
};
class VLRNLayer : public VRoot
{
public:
- virtual void setWindowSize(int32_t windowSize) noexcept = 0;
- virtual int32_t getWindowSize() const noexcept = 0;
+ virtual void setWindowSize(int64_t windowSize) noexcept = 0;
+ virtual int64_t getWindowSize() const noexcept = 0;
virtual void setAlpha(float alpha) noexcept = 0;
virtual float getAlpha() const noexcept = 0;
virtual void setBeta(float beta) noexcept = 0;
@@ -568,33 +580,27 @@ class VConcatenationLayer : public VRoot
class VDeconvolutionLayer : public VRoot
{
public:
- virtual void setKernelSize(DimsHW kernelSize) noexcept = 0;
- virtual DimsHW getKernelSize() const noexcept = 0;
- virtual void setNbOutputMaps(int32_t nbOutputMaps) noexcept = 0;
- virtual int32_t getNbOutputMaps() const noexcept = 0;
- virtual void setStride(DimsHW stride) noexcept = 0;
- virtual DimsHW getStride() const noexcept = 0;
- virtual void setPadding(DimsHW padding) noexcept = 0;
- virtual DimsHW getPadding() const noexcept = 0;
- virtual void setNbGroups(int32_t nbGroups) noexcept = 0;
- virtual int32_t getNbGroups() const noexcept = 0;
+ virtual void setNbOutputMaps(int64_t nbOutputMaps) noexcept = 0;
+ virtual int64_t getNbOutputMaps() const noexcept = 0;
+ virtual void setNbGroups(int64_t nbGroups) noexcept = 0;
+ virtual int64_t getNbGroups() const noexcept = 0;
virtual void setKernelWeights(Weights weights) noexcept = 0;
virtual Weights getKernelWeights() const noexcept = 0;
virtual void setBiasWeights(Weights weights) noexcept = 0;
virtual Weights getBiasWeights() const noexcept = 0;
- virtual void setPrePadding(Dims padding) noexcept = 0;
+ virtual void setPrePadding(Dims const& padding) noexcept = 0;
virtual Dims getPrePadding() const noexcept = 0;
- virtual void setPostPadding(Dims padding) noexcept = 0;
+ virtual void setPostPadding(Dims const& padding) noexcept = 0;
virtual Dims getPostPadding() const noexcept = 0;
virtual void setPaddingMode(PaddingMode paddingMode) noexcept = 0;
virtual PaddingMode getPaddingMode() const noexcept = 0;
- virtual void setKernelSizeNd(Dims kernelSize) noexcept = 0;
+ virtual void setKernelSizeNd(Dims const& kernelSize) noexcept = 0;
virtual Dims getKernelSizeNd() const noexcept = 0;
- virtual void setStrideNd(Dims stride) noexcept = 0;
+ virtual void setStrideNd(Dims const& stride) noexcept = 0;
virtual Dims getStrideNd() const noexcept = 0;
- virtual void setPaddingNd(Dims padding) noexcept = 0;
+ virtual void setPaddingNd(Dims const& padding) noexcept = 0;
virtual Dims getPaddingNd() const noexcept = 0;
- virtual void setDilationNd(Dims dilation) noexcept = 0;
+ virtual void setDilationNd(Dims const& dilation) noexcept = 0;
virtual Dims getDilationNd() const noexcept = 0;
};
@@ -616,31 +622,6 @@ class VGatherLayer : public VRoot
virtual GatherMode getMode() const noexcept = 0;
};
-class VRNNv2Layer : public VRoot
-{
-public:
- virtual int32_t getLayerCount() const noexcept = 0;
- virtual int32_t getHiddenSize() const noexcept = 0;
- virtual int32_t getMaxSeqLength() const noexcept = 0;
- virtual int32_t getDataLength() const noexcept = 0;
- virtual void setSequenceLengths(ITensor& seqLengths) noexcept = 0;
- virtual ITensor* getSequenceLengths() const noexcept = 0;
- virtual void setOperation(RNNOperation op) noexcept = 0;
- virtual RNNOperation getOperation() const noexcept = 0;
- virtual void setInputMode(RNNInputMode op) noexcept = 0;
- virtual RNNInputMode getInputMode() const noexcept = 0;
- virtual void setDirection(RNNDirection op) noexcept = 0;
- virtual RNNDirection getDirection() const noexcept = 0;
- virtual void setWeightsForGate(int32_t layerIndex, RNNGateType gate, bool isW, Weights weights) noexcept = 0;
- virtual Weights getWeightsForGate(int32_t layerIndex, RNNGateType gate, bool isW) const noexcept = 0;
- virtual void setBiasForGate(int32_t layerIndex, RNNGateType gate, bool isW, Weights bias) noexcept = 0;
- virtual Weights getBiasForGate(int32_t layerIndex, RNNGateType gate, bool isW) const noexcept = 0;
- virtual void setHiddenState(ITensor& hidden) noexcept = 0;
- virtual ITensor* getHiddenState() const noexcept = 0;
- virtual void setCellState(ITensor& cell) noexcept = 0;
- virtual ITensor* getCellState() const noexcept = 0;
-};
-
class VPluginLayer : public VRoot
{
public:
@@ -653,6 +634,12 @@ class VPluginV2Layer : public VRoot
virtual IPluginV2& getPlugin() noexcept = 0;
};
+class VPluginV3Layer : public VRoot
+{
+public:
+ virtual IPluginV3& getPlugin() noexcept = 0;
+};
+
class VUnaryLayer : public VRoot
{
public:
@@ -674,13 +661,9 @@ class VReduceLayer : public VRoot
class VPaddingLayer : public VRoot
{
public:
- virtual void setPrePadding(DimsHW padding) noexcept = 0;
- virtual DimsHW getPrePadding() const noexcept = 0;
- virtual void setPostPadding(DimsHW padding) noexcept = 0;
- virtual DimsHW getPostPadding() const noexcept = 0;
- virtual void setPrePaddingNd(Dims padding) noexcept = 0;
+ virtual void setPrePaddingNd(Dims const& padding) noexcept = 0;
virtual Dims getPrePaddingNd() const noexcept = 0;
- virtual void setPostPaddingNd(Dims padding) noexcept = 0;
+ virtual void setPostPaddingNd(Dims const& padding) noexcept = 0;
virtual Dims getPostPaddingNd() const noexcept = 0;
};
@@ -689,7 +672,7 @@ class VShuffleLayer : public VRoot
public:
virtual void setFirstTranspose(Permutation const& permutation) noexcept = 0;
virtual Permutation const& getFirstTranspose() const noexcept = 0;
- virtual void setReshapeDimensions(Dims dimensions) noexcept = 0;
+ virtual void setReshapeDimensions(Dims const& dimensions) noexcept = 0;
virtual Dims getReshapeDimensions() const noexcept = 0;
virtual void setSecondTranspose(Permutation const& permutation) noexcept = 0;
virtual Permutation const& getSecondTranspose() const noexcept = 0;
@@ -700,14 +683,14 @@ class VShuffleLayer : public VRoot
class VSliceLayer : public VRoot
{
public:
- virtual void setStart(Dims start) noexcept = 0;
+ virtual void setStart(Dims const& start) noexcept = 0;
virtual Dims getStart() const noexcept = 0;
- virtual void setSize(Dims size) noexcept = 0;
+ virtual void setSize(Dims const& size) noexcept = 0;
virtual Dims getSize() const noexcept = 0;
- virtual void setStride(Dims stride) noexcept = 0;
+ virtual void setStride(Dims const& stride) noexcept = 0;
virtual Dims getStride() const noexcept = 0;
- virtual void setMode(SliceMode mode) noexcept = 0;
- virtual SliceMode getMode() const noexcept = 0;
+ virtual void setMode(SampleMode mode) noexcept = 0;
+ virtual SampleMode getMode() const noexcept = 0;
};
class VShapeLayer : public VRoot
@@ -760,7 +743,7 @@ class VConstantLayer : public VRoot
public:
virtual void setWeights(Weights weights) noexcept = 0;
virtual Weights getWeights() const noexcept = 0;
- virtual void setDimensions(Dims dimensions) noexcept = 0;
+ virtual void setDimensions(Dims const& dimensions) noexcept = 0;
virtual Dims getDimensions() const noexcept = 0;
};
@@ -772,14 +755,12 @@ class VParametricReLULayer : public VRoot
class VResizeLayer : public VRoot
{
public:
- virtual void setOutputDimensions(Dims dimensions) noexcept = 0;
+ virtual void setOutputDimensions(Dims const& dimensions) noexcept = 0;
virtual Dims getOutputDimensions() const noexcept = 0;
virtual void setScales(float const* scales, int32_t nbScales) noexcept = 0;
virtual int32_t getScales(int32_t size, float* scales) const noexcept = 0;
- virtual void setResizeMode(ResizeMode resizeMode) noexcept = 0;
- virtual ResizeMode getResizeMode() const noexcept = 0;
- virtual void setAlignCorners(bool alignCorners) noexcept = 0;
- virtual bool getAlignCorners() const noexcept = 0;
+ virtual void setResizeMode(InterpolationMode interpolationMode) noexcept = 0;
+ virtual InterpolationMode getResizeMode() const noexcept = 0;
virtual void setCoordinateTransformation(ResizeCoordinateTransformation coordTransform) noexcept = 0;
virtual ResizeCoordinateTransformation getCoordinateTransformation() const noexcept = 0;
virtual void setSelectorForSinglePixel(ResizeSelector selector) noexcept = 0;
@@ -881,7 +862,7 @@ class VAssertionLayer : public VRoot
class VFillLayer : public VRoot
{
public:
- virtual void setDimensions(Dims dimensions) noexcept = 0;
+ virtual void setDimensions(Dims const& dimensions) noexcept = 0;
virtual Dims getDimensions() const noexcept = 0;
virtual void setOperation(FillOperation op) noexcept = 0;
virtual FillOperation getOperation() const noexcept = 0;
@@ -889,6 +870,13 @@ class VFillLayer : public VRoot
virtual double getAlpha() const noexcept = 0;
virtual void setBeta(double beta) noexcept = 0;
virtual double getBeta() const noexcept = 0;
+ virtual void setAlphaInt64(int64_t alpha) noexcept = 0;
+ virtual int64_t getAlphaInt64() const noexcept = 0;
+ virtual void setBetaInt64(int64_t beta) noexcept = 0;
+ virtual int64_t getBetaInt64() const noexcept = 0;
+ virtual bool isAlphaBetaInt64() const noexcept = 0;
+ virtual DataType getToType() const noexcept = 0;
+ virtual void setToType(DataType toType) noexcept = 0;
};
class VQuantizeLayer : public VRoot
@@ -896,6 +884,8 @@ class VQuantizeLayer : public VRoot
public:
virtual int32_t getAxis() const noexcept = 0;
virtual void setAxis(int32_t axis) noexcept = 0;
+ virtual DataType getToType() const noexcept = 0;
+ virtual void setToType(DataType toType) noexcept = 0;
};
class VDequantizeLayer : public VRoot
@@ -903,6 +893,8 @@ class VDequantizeLayer : public VRoot
public:
virtual int32_t getAxis() const noexcept = 0;
virtual void setAxis(int32_t axis) noexcept = 0;
+ virtual DataType getToType() const noexcept = 0;
+ virtual void setToType(DataType toType) noexcept = 0;
};
class VScatterLayer : public VRoot
@@ -965,8 +957,8 @@ class VNormalizationLayer : public VRoot
virtual float getEpsilon() const noexcept = 0;
virtual void setAxes(uint32_t axesMask) noexcept = 0;
virtual uint32_t getAxes() const noexcept = 0;
- virtual void setNbGroups(int32_t nbGroups) noexcept = 0;
- virtual int32_t getNbGroups() const noexcept = 0;
+ virtual void setNbGroups(int64_t nbGroups) noexcept = 0;
+ virtual int64_t getNbGroups() const noexcept = 0;
virtual void setComputePrecision(DataType type) noexcept = 0;
virtual DataType getComputePrecision() const noexcept = 0;
}; // class VNormalizationLayer
@@ -974,26 +966,16 @@ class VNormalizationLayer : public VRoot
class VNetworkDefinition : public VRoot
{
public:
- virtual ITensor* addInput(char const* name, DataType type, Dims dimensions) noexcept = 0;
+ virtual ITensor* addInput(char const* name, DataType type, Dims const& dimensions) noexcept = 0;
virtual void markOutput(ITensor& tensor) noexcept = 0;
- virtual IConvolutionLayer* addConvolution(ITensor& input, int32_t nbOutputMaps, DimsHW kernelSize,
- Weights kernelWeights, Weights biasWeights) noexcept = 0;
- virtual IFullyConnectedLayer* addFullyConnected(
- ITensor& input, int32_t nbOutputs, Weights kernelWeights, Weights biasWeights) noexcept
- = 0;
virtual IActivationLayer* addActivation(ITensor& input, ActivationType type) noexcept = 0;
- virtual IPoolingLayer* addPooling(ITensor& input, PoolingType type, DimsHW windowSize) noexcept = 0;
- virtual ILRNLayer* addLRN(ITensor& input, int32_t window, float alpha, float beta, float k) noexcept = 0;
- virtual IScaleLayer* addScale(ITensor& input, ScaleMode mode, Weights shift, Weights scale, Weights power) noexcept
- = 0;
+ virtual ILRNLayer* addLRN(ITensor& input, int64_t window, float alpha, float beta, float k) noexcept = 0;
+ virtual IScaleLayer* addScale(
+ ITensor& input, ScaleMode mode, Weights shift, Weights scale, Weights power) noexcept = 0;
virtual ISoftMaxLayer* addSoftMax(ITensor& input) noexcept = 0;
virtual IConcatenationLayer* addConcatenation(ITensor* const* inputs, int32_t nbInputs) noexcept = 0;
- virtual IDeconvolutionLayer* addDeconvolution(
- ITensor& input, int32_t nbOutputMaps, DimsHW kernelSize, Weights kernelWeights, Weights biasWeights) noexcept
- = 0;
virtual IElementWiseLayer* addElementWise(ITensor& input1, ITensor& input2, ElementWiseOperation op) noexcept = 0;
virtual IUnaryLayer* addUnary(ITensor& input, UnaryOperation operation) noexcept = 0;
- virtual IPaddingLayer* addPadding(ITensor& input, DimsHW prePadding, DimsHW postPadding) noexcept = 0;
virtual IShuffleLayer* addShuffle(ITensor& input) noexcept = 0;
virtual int32_t getNbLayers() const noexcept = 0;
virtual ILayer* getLayer(int32_t index) const noexcept = 0;
@@ -1008,16 +990,15 @@ class VNetworkDefinition : public VRoot
virtual IGatherLayer* addGather(ITensor& data, ITensor& indices, int32_t axis) noexcept = 0;
virtual IRaggedSoftMaxLayer* addRaggedSoftMax(ITensor& input, ITensor& bounds) noexcept = 0;
virtual IMatrixMultiplyLayer* addMatrixMultiply(
- ITensor& input0, MatrixOperation op0, ITensor& input1, MatrixOperation op1) noexcept
- = 0;
- virtual IConstantLayer* addConstant(Dims dimensions, Weights weights) noexcept = 0;
- virtual IRNNv2Layer* addRNNv2(
- ITensor& input, int32_t layerCount, int32_t hiddenSize, int32_t maxSeqLen, RNNOperation op) noexcept = 0;
+ ITensor& input0, MatrixOperation op0, ITensor& input1, MatrixOperation op1) noexcept = 0;
+ virtual IConstantLayer* addConstant(Dims const& dimensions, Weights weights) noexcept = 0;
virtual IIdentityLayer* addIdentity(ITensor& input) noexcept = 0;
virtual void removeTensor(ITensor& tensor) noexcept = 0;
virtual void unmarkOutput(ITensor& tensor) noexcept = 0;
virtual IPluginV2Layer* addPluginV2(ITensor* const* inputs, int32_t nbInputs, IPluginV2& plugin) noexcept = 0;
- virtual ISliceLayer* addSlice(ITensor& input, Dims start, Dims size, Dims stride) noexcept = 0;
+ virtual IPluginV3Layer* addPluginV3(ITensor* const* inputs, int32_t nbInputs, ITensor* const* shapeInputs,
+ int32_t nbShapeInputs, IPluginV3& plugin) noexcept = 0;
+ virtual ISliceLayer* addSlice(ITensor& input, Dims const& start, Dims const& size, Dims const& stride) noexcept = 0;
virtual void setName(char const* name) noexcept = 0;
virtual char const* getName() const noexcept = 0;
virtual IShapeLayer* addShape(ITensor& input) noexcept = 0;
@@ -1026,21 +1007,19 @@ class VNetworkDefinition : public VRoot
virtual bool unmarkOutputForShapes(ITensor& tensor) noexcept = 0;
virtual IParametricReLULayer* addParametricReLU(ITensor& input, ITensor& slope) noexcept = 0;
virtual IConvolutionLayer* addConvolutionNd(
- ITensor& input, int32_t nbOutputMaps, Dims kernelSize, Weights kernelWeights, Weights biasWeights) noexcept
+ ITensor& input, int64_t nbOutputMaps, Dims const& kernelSize, Weights kernelWeights, Weights biasWeights) noexcept
= 0;
- virtual IPoolingLayer* addPoolingNd(ITensor& input, PoolingType type, Dims windowSize) noexcept = 0;
+ virtual IPoolingLayer* addPoolingNd(ITensor& input, PoolingType type, Dims const& windowSize) noexcept = 0;
virtual IDeconvolutionLayer* addDeconvolutionNd(
- ITensor& input, int32_t nbOutputMaps, Dims kernelSize, Weights kernelWeights, Weights biasWeights) noexcept
+ ITensor& input, int64_t nbOutputMaps, Dims const& kernelSize, Weights kernelWeights, Weights biasWeights) noexcept
= 0;
virtual IScaleLayer* addScaleNd(
- ITensor& input, ScaleMode mode, Weights shift, Weights scale, Weights power, int32_t channelAxis) noexcept
- = 0;
+ ITensor& input, ScaleMode mode, Weights shift, Weights scale, Weights power, int32_t channelAxis) noexcept = 0;
virtual IResizeLayer* addResize(ITensor& input) noexcept = 0;
- virtual bool hasExplicitPrecision() const noexcept = 0;
virtual ILoop* addLoop() noexcept = 0;
virtual ISelectLayer* addSelect(ITensor& condition, ITensor& thenInput, ITensor& elseInput) noexcept = 0;
- virtual IFillLayer* addFill(Dims dimensions, FillOperation op) noexcept = 0;
- virtual IPaddingLayer* addPaddingNd(ITensor& input, Dims prePadding, Dims postPadding) noexcept = 0;
+ virtual IFillLayer* addFill(Dims const& dimensions, FillOperation op) noexcept = 0;
+ virtual IPaddingLayer* addPaddingNd(ITensor& input, Dims const& prePadding, Dims const& postPadding) noexcept = 0;
virtual bool setWeightsName(Weights weights, char const* name) noexcept = 0;
virtual void setErrorRecorder(IErrorRecorder* recorder) noexcept = 0;
virtual IErrorRecorder* getErrorRecorder() const noexcept = 0;
@@ -1060,12 +1039,19 @@ class VNetworkDefinition : public VRoot
ITensor& input, ITensor& scale, ITensor& bias, uint32_t axesMask) noexcept = 0;
virtual ICastLayer* addCast(ITensor& input, DataType toType) noexcept = 0;
virtual IBuilder& getBuilder() const noexcept = 0;
+ virtual NetworkDefinitionCreationFlags getFlags() const noexcept = 0;
+ virtual bool getFlag(NetworkDefinitionCreationFlag networkDefinitionCreationFlag) const noexcept = 0;
+ virtual IQuantizeLayer* addQuantizeV2(ITensor& input, ITensor& scale, DataType outputType) noexcept = 0;
+ virtual IDequantizeLayer* addDequantizeV2(ITensor& input, ITensor& scale, DataType outputType) noexcept = 0;
+ virtual IFillLayer* addFillV2(Dims const& dimensions, FillOperation op, DataType outputType) noexcept = 0;
+ virtual bool markDebug(ITensor& tensor) noexcept = 0;
+ virtual bool unmarkDebug(ITensor& tensor) noexcept = 0;
+ virtual bool isDebugTensor(nvinfer1::ITensor const& tensor) const noexcept = 0;
};
class VAlgorithmIOInfo : public VRoot
{
public:
- virtual TensorFormat getTensorFormat() const noexcept = 0;
virtual DataType getDataType() const noexcept = 0;
virtual Dims getStrides() const noexcept = 0;
virtual int64_t getVectorizedDim() const noexcept = 0;
@@ -1091,7 +1077,6 @@ class VAlgorithmContext : public VRoot
class VAlgorithm : public VRoot
{
public:
- virtual IAlgorithmIOInfo const& getAlgorithmIOInfo(int32_t index) const noexcept = 0;
virtual IAlgorithmVariant const& getAlgorithmVariant() const noexcept = 0;
virtual float getTimingMSec() const noexcept = 0;
virtual std::size_t getWorkspaceSize() const noexcept = 0;
@@ -1109,16 +1094,12 @@ class VTimingCache : public VRoot
class VBuilderConfig : public VRoot
{
public:
- virtual void setMinTimingIterations(int32_t minTiming) noexcept = 0;
- virtual int32_t getMinTimingIterations() const noexcept = 0;
virtual void setAvgTimingIterations(int32_t avgTiming) noexcept = 0;
virtual int32_t getAvgTimingIterations() const noexcept = 0;
virtual void setEngineCapability(EngineCapability capability) noexcept = 0;
virtual EngineCapability getEngineCapability() const noexcept = 0;
virtual void setInt8Calibrator(IInt8Calibrator* calibrator) noexcept = 0;
virtual IInt8Calibrator* getInt8Calibrator() const noexcept = 0;
- virtual void setMaxWorkspaceSize(std::size_t workspaceSize) noexcept = 0;
- virtual std::size_t getMaxWorkspaceSize() const noexcept = 0;
virtual void setFlags(BuilderFlags builderFlags) noexcept = 0;
virtual BuilderFlags getFlags() const noexcept = 0;
virtual void clearFlag(BuilderFlag builderFlag) noexcept = 0;
@@ -1167,21 +1148,29 @@ class VBuilderConfig : public VRoot
virtual int32_t getNbPluginsToSerialize() const noexcept = 0;
virtual void setMaxAuxStreams(int32_t nbStreams) noexcept = 0;
virtual int32_t getMaxAuxStreams() const noexcept = 0;
+ virtual void setProgressMonitor(IProgressMonitor* monitor) noexcept = 0;
+ virtual IProgressMonitor* getProgressMonitor() const noexcept = 0;
+};
+
+class VSerializationConfig : public VRoot
+{
+public:
+ virtual bool setFlags(SerializationFlags serializationFlags) noexcept = 0;
+ virtual SerializationFlags getFlags() const noexcept = 0;
+ virtual bool clearFlag(SerializationFlag serializationFlag) noexcept = 0;
+ virtual bool setFlag(SerializationFlag serializationFlag) noexcept = 0;
+ virtual bool getFlag(SerializationFlag serializationFlag) const noexcept = 0;
};
class VBuilder : public VRoot
{
public:
- virtual void setMaxBatchSize(int32_t batchSize) noexcept = 0;
- virtual int32_t getMaxBatchSize() const noexcept = 0;
virtual bool platformHasFastFp16() const noexcept = 0;
virtual bool platformHasFastInt8() const noexcept = 0;
virtual int32_t getMaxDLABatchSize() const noexcept = 0;
virtual int32_t getNbDLACores() const noexcept = 0;
virtual void setGpuAllocator(IGpuAllocator* allocator) noexcept = 0;
virtual nvinfer1::IBuilderConfig* createBuilderConfig() noexcept = 0;
- virtual nvinfer1::ICudaEngine* buildEngineWithConfig(INetworkDefinition& network, IBuilderConfig& config) noexcept
- = 0;
virtual nvinfer1::INetworkDefinition* createNetworkV2(NetworkDefinitionCreationFlags flags) noexcept = 0;
virtual nvinfer1::IOptimizationProfile* createOptimizationProfile() noexcept = 0;
virtual void setErrorRecorder(IErrorRecorder* recorder) noexcept = 0;
diff --git a/include/NvInferLegacyDims.h b/include/NvInferLegacyDims.h
index 9c757043..204d17a8 100644
--- a/include/NvInferLegacyDims.h
+++ b/include/NvInferLegacyDims.h
@@ -36,6 +36,7 @@ namespace nvinfer1
{
//!
//! \class Dims2
+//!
//! \brief Descriptor for two-dimensional data.
//!
class Dims2 : public Dims
@@ -55,12 +56,12 @@ class Dims2 : public Dims
//! \param d0 The first element.
//! \param d1 The second element.
//!
- Dims2(int32_t d0, int32_t d1)
+ Dims2(int64_t d0, int64_t d1)
{
nbDims = 2;
d[0] = d0;
d[1] = d1;
- for (int32_t i{nbDims}; i < Dims::MAX_DIMS; ++i)
+ for (int64_t i{nbDims}; i < Dims::MAX_DIMS; ++i)
{
d[i] = 0;
}
@@ -69,6 +70,7 @@ class Dims2 : public Dims
//!
//! \class DimsHW
+//!
//! \brief Descriptor for two-dimensional spatial data.
//!
class DimsHW : public Dims2
@@ -88,7 +90,7 @@ class DimsHW : public Dims2
//! \param height the height of the data
//! \param width the width of the data
//!
- DimsHW(int32_t height, int32_t width)
+ DimsHW(int64_t height, int64_t width)
: Dims2(height, width)
{
}
@@ -98,7 +100,7 @@ class DimsHW : public Dims2
//!
//! \return The height.
//!
- int32_t& h()
+ int64_t& h()
{
return d[0];
}
@@ -108,7 +110,7 @@ class DimsHW : public Dims2
//!
//! \return The height.
//!
- int32_t h() const
+ int64_t h() const
{
return d[0];
}
@@ -118,7 +120,7 @@ class DimsHW : public Dims2
//!
//! \return The width.
//!
- int32_t& w()
+ int64_t& w()
{
return d[1];
}
@@ -128,7 +130,7 @@ class DimsHW : public Dims2
//!
//! \return The width.
//!
- int32_t w() const
+ int64_t w() const
{
return d[1];
}
@@ -136,6 +138,7 @@ class DimsHW : public Dims2
//!
//! \class Dims3
+//!
//! \brief Descriptor for three-dimensional data.
//!
class Dims3 : public Dims2
@@ -156,7 +159,7 @@ class Dims3 : public Dims2
//! \param d1 The second element.
//! \param d2 The third element.
//!
- Dims3(int32_t d0, int32_t d1, int32_t d2)
+ Dims3(int64_t d0, int64_t d1, int64_t d2)
: Dims2(d0, d1)
{
nbDims = 3;
@@ -166,6 +169,7 @@ class Dims3 : public Dims2
//!
//! \class Dims4
+//!
//! \brief Descriptor for four-dimensional data.
//!
class Dims4 : public Dims3
@@ -187,7 +191,7 @@ class Dims4 : public Dims3
//! \param d2 The third element.
//! \param d3 The fourth element.
//!
- Dims4(int32_t d0, int32_t d1, int32_t d2, int32_t d3)
+ Dims4(int64_t d0, int64_t d1, int64_t d2, int64_t d3)
: Dims3(d0, d1, d2)
{
nbDims = 4;
diff --git a/include/NvInferPluginUtils.h b/include/NvInferPluginUtils.h
index c501f8e5..bfc924e5 100644
--- a/include/NvInferPluginUtils.h
+++ b/include/NvInferPluginUtils.h
@@ -33,142 +33,118 @@ namespace plugin
{
//!
-//! \brief The Permute plugin layer permutes the input tensor by changing the memory order of the data.
-//! Quadruple defines a structure that contains an array of 4 integers. They can represent the permute orders or the
-//! strides in each dimension.
-//!
-typedef struct
-{
- int32_t data[4];
-} Quadruple;
-
+//! \struct PriorBoxParameters
//!
//! \brief The PriorBox plugin layer generates the prior boxes of designated sizes and aspect ratios across all
-//! dimensions (H x W). PriorBoxParameters defines a set of parameters for creating the PriorBox plugin layer. It
-//! contains:
-//! \param minSize Minimum box size in pixels. Can not be nullptr.
-//! \param maxSize Maximum box size in pixels. Can be nullptr.
-//! \param aspectRatios Aspect ratios of the boxes. Can be nullptr.
-//! \param numMinSize Number of elements in minSize. Must be larger than 0.
-//! \param numMaxSize Number of elements in maxSize. Can be 0 or same as numMinSize.
-//! \param numAspectRatios Number of elements in aspectRatios. Can be 0.
-//! \param flip If true, will flip each aspect ratio. For example, if there is an aspect ratio "r", the aspect ratio
-//! "1.0/r" will be generated as well.
-//! \param clip If true, will clip the prior so that it is within [0,1].
-//! \param variance Variance for adjusting the prior boxes.
-//! \param imgH Image height. If 0, then the H dimension of the data tensor will be used.
-//! \param imgW Image width. If 0, then the W dimension of the data tensor will be used.
-//! \param stepH Step in H. If 0, then (float)imgH/h will be used where h is the H dimension of the 1st input tensor.
-//! \param stepW Step in W. If 0, then (float)imgW/w will be used where w is the W dimension of the 1st input tensor.
-//! \param offset Offset to the top left corner of each cell.
+//! dimensions (H x W).
+//!
+//! PriorBoxParameters defines a set of parameters for creating the PriorBox plugin layer.
//!
struct PriorBoxParameters
{
- float *minSize, *maxSize, *aspectRatios;
- int32_t numMinSize, numMaxSize, numAspectRatios;
- bool flip;
- bool clip;
- float variance[4];
- int32_t imgH, imgW;
- float stepH, stepW;
- float offset;
+ float *minSize; //!< Minimum box size in pixels. Can not be nullptr.
+ float *maxSize; //!< Maximum box size in pixels. Can be nullptr.
+ float *aspectRatios; //!< Aspect ratios of the boxes. Can be nullptr.
+ int32_t numMinSize; //!< Number of elements in minSize. Must be larger than 0.
+ int32_t numMaxSize; //!< Number of elements in maxSize. Can be 0 or same as numMinSize.
+ int32_t numAspectRatios; //!< Number of elements in aspectRatios. Can be 0.
+ bool flip; //!< If true, will flip each aspect ratio. For example,
+ //!< if there is an aspect ratio "r", the aspect ratio "1.0/r" will be generated as well.
+ bool clip; //!< If true, will clip the prior so that it is within [0,1].
+ float variance[4]; //!< Variance for adjusting the prior boxes.
+ int32_t imgH; //!< Image height. If 0, then the H dimension of the data tensor will be used.
+ int32_t imgW; //!< Image width. If 0, then the W dimension of the data tensor will be used.
+ float stepH; //!< Step in H. If 0, then (float)imgH/h will be used where h is the H dimension of the 1st input tensor.
+ float stepW; //!< Step in W. If 0, then (float)imgW/w will be used where w is the W dimension of the 1st input tensor.
+ float offset; //!< Offset to the top left corner of each cell.
};
+//!
+//! \struct RPROIParams
//!
//! \brief RPROIParams is used to create the RPROIPlugin instance.
-//! It contains:
-//! \param poolingH Height of the output in pixels after ROI pooling on feature map.
-//! \param poolingW Width of the output in pixels after ROI pooling on feature map.
-//! \param featureStride Feature stride; ratio of input image size to feature map size. Assuming that max pooling layers
-//! in the neural network use square filters.
-//! \param preNmsTop Number of proposals to keep before applying NMS.
-//! \param nmsMaxOut Number of remaining proposals after applying NMS.
-//! \param anchorsRatioCount Number of anchor box ratios.
-//! \param anchorsScaleCount Number of anchor box scales.
-//! \param iouThreshold IoU (Intersection over Union) threshold used for the NMS step.
-//! \param minBoxSize Minimum allowed bounding box size before scaling, used for anchor box calculation.
-//! \param spatialScale Spatial scale between the input image and the last feature map.
//!
struct RPROIParams
{
- int32_t poolingH;
- int32_t poolingW;
- int32_t featureStride;
- int32_t preNmsTop;
- int32_t nmsMaxOut;
- int32_t anchorsRatioCount;
- int32_t anchorsScaleCount;
- float iouThreshold;
- float minBoxSize;
- float spatialScale;
+ int32_t poolingH; //!< Height of the output in pixels after ROI pooling on feature map.
+ int32_t poolingW; //!< Width of the output in pixels after ROI pooling on feature map.
+ int32_t featureStride; //!< Feature stride; ratio of input image size to feature map size.
+ //!< Assuming that max pooling layers in the neural network use square filters.
+ int32_t preNmsTop; //!< Number of proposals to keep before applying NMS.
+ int32_t nmsMaxOut; //!< Number of remaining proposals after applying NMS.
+ int32_t anchorsRatioCount; //!< Number of anchor box ratios.
+ int32_t anchorsScaleCount; //!< Number of anchor box scales.
+ float iouThreshold; //!< IoU (Intersection over Union) threshold used for the NMS step.
+ float minBoxSize; //!< Minimum allowed bounding box size before scaling, used for anchor box calculation.
+ float spatialScale; //!< Spatial scale between the input image and the last feature map.
};
-
+//!
+//! \struct GridAnchorParameters
//!
//! \brief The Anchor Generator plugin layer generates the prior boxes of designated sizes and aspect ratios across all dimensions (H x W).
//! GridAnchorParameters defines a set of parameters for creating the plugin layer for all feature maps.
-//! It contains:
-//! \param minScale Scale of anchors corresponding to finest resolution.
-//! \param maxScale Scale of anchors corresponding to coarsest resolution.
-//! \param aspectRatios List of aspect ratios to place on each grid point.
-//! \param numAspectRatios Number of elements in aspectRatios.
-//! \param H Height of feature map to generate anchors for.
-//! \param W Width of feature map to generate anchors for.
-//! \param variance Variance for adjusting the prior boxes.
//!
struct GridAnchorParameters
{
- float minSize, maxSize;
- float* aspectRatios;
- int32_t numAspectRatios, H, W;
- float variance[4];
+ float minSize; //!< Scale of anchors corresponding to finest resolution.
+ float maxSize; //!< Scale of anchors corresponding to coarsest resolution.
+ float* aspectRatios; //!< List of aspect ratios to place on each grid point.
+ int32_t numAspectRatios; //!< Number of elements in aspectRatios.
+ int32_t H; //!< Height of feature map to generate anchors for.
+ int32_t W; //!< Width of feature map to generate anchors for.
+ float variance[4]; //!< Variance for adjusting the prior boxes.
};
//!
//! \enum CodeTypeSSD
+//!
//! \brief The type of encoding used for decoding the bounding boxes and loc_data.
//!
+//! \deprecated Deprecated in TensorRT 10.0. DetectionOutput plugin is deprecated.
+//!
enum class CodeTypeSSD : int32_t
{
- CORNER = 0, //!< Use box corners.
- CENTER_SIZE = 1, //!< Use box centers and size.
- CORNER_SIZE = 2, //!< Use box centers and size.
- TF_CENTER = 3 //!< Use box centers and size but flip x and y coordinates.
+ CORNER TRT_DEPRECATED_ENUM = 0, //!< Use box corners.
+ CENTER_SIZE TRT_DEPRECATED_ENUM = 1, //!< Use box centers and size.
+ CORNER_SIZE TRT_DEPRECATED_ENUM = 2, //!< Use box centers and size.
+ TF_CENTER TRT_DEPRECATED_ENUM = 3 //!< Use box centers and size but flip x and y coordinates.
};
//!
-//! \brief The DetectionOutput plugin layer generates the detection output based on location and confidence predictions by doing non maximum suppression.
-//! This plugin first decodes the bounding boxes based on the anchors generated. It then performs non_max_suppression on the decoded bounding boxes.
+//! \struct DetectionOutputParameters
+//!
+//! \brief The DetectionOutput plugin layer generates the detection output
+//! based on location and confidence predictions by doing non maximum suppression.
+//!
+//! This plugin first decodes the bounding boxes based on the anchors generated.
+//! It then performs non_max_suppression on the decoded bounding boxes.
//! DetectionOutputParameters defines a set of parameters for creating the DetectionOutput plugin layer.
-//! It contains:
-//! \param shareLocation If true, bounding box are shared among different classes.
-//! \param varianceEncodedInTarget If true, variance is encoded in target. Otherwise we need to adjust the predicted offset accordingly.
-//! \param backgroundLabelId Background label ID. If there is no background class, set it as -1.
-//! \param numClasses Number of classes to be predicted.
-//! \param topK Number of boxes per image with top confidence scores that are fed into the NMS algorithm.
-//! \param keepTopK Number of total bounding boxes to be kept per image after NMS step.
-//! \param confidenceThreshold Only consider detections whose confidences are larger than a threshold.
-//! \param nmsThreshold Threshold to be used in NMS.
-//! \param codeType Type of coding method for bbox.
-//! \param inputOrder Specifies the order of inputs {loc_data, conf_data, priorbox_data}.
-//! \param confSigmoid Set to true to calculate sigmoid of confidence scores.
-//! \param isNormalized Set to true if bounding box data is normalized by the network.
-//! \param isBatchAgnostic Defaults to true. Set to false if prior boxes are unique per batch
-//!
-struct DetectionOutputParameters
+//!
+//! \deprecated Deprecated in TensorRT 10.0. DetectionOutput plugin is deprecated.
+//!
+struct TRT_DEPRECATED DetectionOutputParameters
{
- bool shareLocation, varianceEncodedInTarget;
- int32_t backgroundLabelId, numClasses, topK, keepTopK;
- float confidenceThreshold, nmsThreshold;
- CodeTypeSSD codeType;
- int32_t inputOrder[3];
- bool confSigmoid;
- bool isNormalized;
- bool isBatchAgnostic{true};
+ bool shareLocation; //!< If true, bounding box are shared among different classes.
+ bool varianceEncodedInTarget; //!< If true, variance is encoded in target.
+ //!< Otherwise we need to adjust the predicted offset accordingly.
+ int32_t backgroundLabelId; //!< Background label ID. If there is no background class, set it as -1.
+ int32_t numClasses; //!< Number of classes to be predicted.
+ int32_t topK; //!< Number of boxes per image with top confidence scores that are fed
+ //!< into the NMS algorithm.
+ int32_t keepTopK; //!< Number of total bounding boxes to be kept per image after NMS step.
+ float confidenceThreshold; //!< Only consider detections whose confidences are larger than a threshold.
+ float nmsThreshold; //!< Threshold to be used in NMS.
+ CodeTypeSSD codeType; //!< Type of coding method for bbox.
+ int32_t inputOrder[3]; //!< Specifies the order of inputs {loc_data, conf_data, priorbox_data}.
+ bool confSigmoid; //!< Set to true to calculate sigmoid of confidence scores.
+ bool isNormalized; //!< Set to true if bounding box data is normalized by the network.
+ bool isBatchAgnostic{true}; //!< Defaults to true. Set to false if prior boxes are unique per batch.
};
//!
-//! \brief When performing yolo9000, softmaxTree is helping to do softmax on confidence scores, for element to get the precise classification through word-tree structured classification definition.
+//! \brief When performing yolo9000, softmaxTree is helping to do softmax on confidence scores,
+//! for element to get the precise classification through word-tree structured classification definition.
//!
struct softmaxTree
{
@@ -178,53 +154,48 @@ struct softmaxTree
int32_t* child;
int32_t* group;
char** name;
-
int32_t groups;
int32_t* groupSize;
int32_t* groupOffset;
};
//!
-//! \brief The Region plugin layer performs region proposal calculation: generate 5 bounding boxes per cell (for yolo9000, generate 3 bounding boxes per cell).
-//! For each box, calculating its probablities of objects detections from 80 pre-defined classifications (yolo9000 has 9418 pre-defined classifications,
-//! and these 9418 items are organized as work-tree structure).
+//! \brief The Region plugin layer performs region proposal calculation.
+//!
+//! Generate 5 bounding boxes per cell (for yolo9000, generate 3 bounding boxes per cell).
+//! For each box, calculating its probabilities of objects detections from 80 pre-defined classifications
+//! (yolo9000 has 9418 pre-defined classifications, and these 9418 items are organized as work-tree structure).
//! RegionParameters defines a set of parameters for creating the Region plugin layer.
-//! \param num Number of predicted bounding box for each grid cell.
-//! \param coords Number of coordinates for a bounding box.
-//! \param classes Number of classifications to be predicted.
-//! \param smTree Helping structure to do softmax on confidence scores.
//!
struct RegionParameters
{
- int32_t num;
- int32_t coords;
- int32_t classes;
- softmaxTree* smTree;
+ int32_t num; //!< Number of predicted bounding box for each grid cell.
+ int32_t coords; //!< Number of coordinates for a bounding box.
+ int32_t classes; //!< Number of classifications to be predicted.
+ softmaxTree* smTree; //!< Helping structure to do softmax on confidence scores.
};
//!
//! \brief The NMSParameters are used by the BatchedNMSPlugin for performing
//! the non_max_suppression operation over boxes for object detection networks.
-//! \param shareLocation If set to true, the boxes inputs are shared across all
-//! classes. If set to false, the boxes input should account for per class box data.
-//! \param backgroundLabelId Label ID for the background class. If there is no background class, set it as -1
-//! \param numClasses Number of classes in the network.
-//! \param topK Number of bounding boxes to be fed into the NMS step.
-//! \param keepTopK Number of total bounding boxes to be kept per image after NMS step.
-//! Should be less than or equal to the topK value.
-//! \param scoreThreshold Scalar threshold for score (low scoring boxes are removed).
-//! \param iouThreshold scalar threshold for IOU (new boxes that have high IOU overlap
-//! with previously selected boxes are removed).
-//! \param isNormalized Set to false, if the box coordinates are not
-//! normalized, i.e. not in the range [0,1]. Defaults to false.
//!
-
-struct NMSParameters
+//! \deprecated Deprecated in TensorRT 10.0. BatchedNMSPlugin plugin is deprecated.
+//!
+struct TRT_DEPRECATED NMSParameters
{
- bool shareLocation;
- int32_t backgroundLabelId, numClasses, topK, keepTopK;
- float scoreThreshold, iouThreshold;
- bool isNormalized;
+ bool shareLocation; //!< If set to true, the boxes inputs are shared across all classes.
+ //!< If set to false, the boxes input should account for per class box data.
+ int32_t backgroundLabelId; //!< Label ID for the background class.
+ //!< If there is no background class, set it as -1
+ int32_t numClasses; //!< Number of classes in the network.
+ int32_t topK; //!< Number of bounding boxes to be fed into the NMS step.
+ int32_t keepTopK; //!< Number of total bounding boxes to be kept per image after NMS step.
+ //!< Should be less than or equal to the topK value.
+ float scoreThreshold; //!< Scalar threshold for score (low scoring boxes are removed).
+ float iouThreshold; //!< A scalar threshold for IOU (new boxes that have high IOU overlap
+ //!< with previously selected boxes are removed).
+ bool isNormalized; //!< Set to false, if the box coordinates are not normalized,
+ //!< i.e. not in the range [0,1]. Defaults to false.
};
} // namespace plugin
diff --git a/include/NvInferRuntime.h b/include/NvInferRuntime.h
index 925531e0..04434931 100644
--- a/include/NvInferRuntime.h
+++ b/include/NvInferRuntime.h
@@ -69,7 +69,6 @@ class INoCopy
//! network operations that are DLA compatible and the resulting serialized engine can be executed using standalone
//! DLA runtime APIs. See sampleCudla for an example of integrating cuDLA APIs with TensorRT APIs.
//!
-
enum class EngineCapability : int32_t
{
//!
@@ -78,9 +77,6 @@ enum class EngineCapability : int32_t
//!
kSTANDARD = 0,
- //! \deprecated Deprecated in TensorRT 8.0. Superseded by kSTANDARD.
- kDEFAULT TRT_DEPRECATED_ENUM = kSTANDARD,
-
//!
//! Safety: TensorRT flow with restrictions targeting the safety runtime.
//! See safety documentation for list of supported layers and formats.
@@ -89,18 +85,12 @@ enum class EngineCapability : int32_t
//! This flag is only supported in NVIDIA Drive(R) products.
kSAFETY = 1,
- //! \deprecated Deprecated in TensorRT 8.0. Superseded by kSAFETY.
- kSAFE_GPU TRT_DEPRECATED_ENUM = kSAFETY,
-
//!
//! DLA Standalone: TensorRT flow with restrictions targeting external, to TensorRT, DLA runtimes.
//! See DLA documentation for list of supported layers and formats.
//! This flow supports only DeviceType::kDLA.
//!
kDLA_STANDALONE = 2,
-
- //! \deprecated Deprecated in TensorRT 8.0. Superseded by kDLA_STANDALONE.
- kSAFE_DLA TRT_DEPRECATED_ENUM = kDLA_STANDALONE,
};
namespace impl
@@ -167,17 +157,6 @@ class IHostMemory : public INoCopy
{
return mImpl->type();
}
- //!
- //! Destroy the allocated memory.
- //!
- //! \deprecated Deprecated in TRT 8.0. Superseded by `delete`.
- //!
- //! \warning Calling destroy on a managed pointer will result in a double-free error.
- //!
- TRT_DEPRECATED void destroy() noexcept
- {
- delete this;
- }
protected:
apiv::VHostMemory* mImpl;
@@ -215,6 +194,7 @@ constexpr inline int32_t EnumMax() noexcept
//!
//! \enum TensorLocation
+//!
//! \brief The location for tensor data storage, device or host.
//!
enum class TensorLocation : int32_t
@@ -236,27 +216,33 @@ struct EnumMaxImpl
//!
//! \class IDimensionExpr
//!
-//! An IDimensionExpr represents an integer expression constructed from constants,
+//! \brief An IDimensionExpr represents an integer expression constructed from constants,
//! input dimensions, and binary operations. These expressions are can be used
-//! in overrides of IPluginV2DynamicExt::getOutputDimensions to define output
+//! in overrides of IPluginV2DynamicExt::getOutputDimensions or IPluginV3OneBuild::getOutputShapes() to define output
//! dimensions in terms of input dimensions.
//!
//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
//!
-//! \see DimensionOperation, IPluginV2DynamicExt::getOutputDimensions
+//! \see DimensionOperation, IPluginV2DynamicExt::getOutputDimensions, IPluginV3OneBuild::getOutputShapes()
//!
class IDimensionExpr : public INoCopy
{
public:
- //! Return true if expression is a build-time constant.
+ //!
+ //! \brief Return true if expression is a build-time constant.
+ //!
bool isConstant() const noexcept
{
return mImpl->isConstant();
}
+ //!
+ //! \brief Get the value of the constant.
+ //!
//! If isConstant(), returns value of the constant.
- //! If !isConstant(), return std::numeric_limits::min().
- int32_t getConstantValue() const noexcept
+ //! If !isConstant(), return std::numeric_limits::min().
+ //!
+ int64_t getConstantValue() const noexcept
{
return mImpl->getConstantValue();
}
@@ -264,20 +250,31 @@ class IDimensionExpr : public INoCopy
protected:
apiv::VDimensionExpr* mImpl;
virtual ~IDimensionExpr() noexcept = default;
+
+public:
+ //!
+ //! \brief Return true if this denotes the value of a size tensor.
+ //!
+ //! \return True if this was created with method IExprBuilder::declareSizeTensor, false otherwise
+ //!
+ bool isSizeTensor() const noexcept
+ {
+ return mImpl->isSizeTensor();
+ }
};
//!
//! \class IExprBuilder
//!
-//! Object for constructing IDimensionExpr.
+//! \brief Object for constructing IDimensionExpr.
//!
//! There is no public way to construct an IExprBuilder. It appears as an argument to
-//! method IPluginV2DynamicExt::getOutputDimensions(). Overrides of that method can use
-//! that IExprBuilder argument to construct expressions that define output dimensions
-//! in terms of input dimensions.
+//! method IPluginV2DynamicExt::getOutputDimensions() and IPluginV3OneBuild::getOutputShapes(). Overrides of that
+//! method can use that IExprBuilder argument to construct expressions that define output dimensions in terms of input
+//! dimensions.
//!
//! Clients should assume that any values constructed by the IExprBuilder are destroyed
-//! after IPluginV2DynamicExt::getOutputDimensions() returns.
+//! after IPluginV2DynamicExt::getOutputDimensions() or IPluginV3OneBuild::getOutputShapes() returns.
//!
//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
//!
@@ -286,14 +283,20 @@ class IDimensionExpr : public INoCopy
class IExprBuilder : public INoCopy
{
public:
- //! Return pointer to IDimensionExp for given value.
- IDimensionExpr const* constant(int32_t value) noexcept
+ //!
+ //! \brief Return pointer to IDimensionExp for given value.
+ //!
+ IDimensionExpr const* constant(int64_t value) noexcept
{
return mImpl->constant(value);
}
+ //!
+ //! \brief Get the operation.
+ //!
//! Return pointer to IDimensionExp that represents the given operation applied to first and second.
//! Returns nullptr if op is not a valid DimensionOperation.
+ //!
IDimensionExpr const* operation(
DimensionOperation op, IDimensionExpr const& first, IDimensionExpr const& second) noexcept
{
@@ -303,12 +306,42 @@ class IExprBuilder : public INoCopy
protected:
apiv::VExprBuilder* mImpl;
virtual ~IExprBuilder() noexcept = default;
+
+public:
+ //!
+ //! \brief Declare a size tensor at the given output index, with the specified auto-tuning formula and upper bound.
+ //!
+ //! A size tensor allows a plugin to have output dimensions that cannot be computed solely from input dimensions.
+ //! For example, suppose a plugin implements the equivalent of INonZeroLayer for 2D input. The plugin can
+ //! have one output for the indices of non-zero elements, and a second output containing the number of non-zero
+ //! elements. Suppose the input has size [M,N] and has K non-zero elements. The plugin can write K to the second
+ //! output. When telling TensorRT that the first output has shape [2,K], plugin uses IExprBuilder::constant() and
+ //! IExprBuilder::declareSizeTensor(1,...) to create the IDimensionExpr that respectively denote 2 and K.
+ //!
+ //! TensorRT also needs to know the value of K to use for auto-tuning and an upper bound on K so that it can
+ //! allocate memory for the output tensor. In the example, supposed typically half of the plugin's input elements
+ //! are non-zero, and all the elements might be nonzero. then using M*N/2 might be a good expression for the opt
+ //! parameter, and M*N for the upper bound. IDimensionsExpr for these expressions can be constructed from
+ //! IDimensionsExpr for the input dimensions.
+ //!
+ //! \param outputIndex index of a plugin output that is a size tensor.
+ //! \param opt formula for computing auto-tuning value. Must not depend on a size tensor.
+ //! \param upper Upper bound on the size tensor.
+ //!
+ //! \return IDimensionExpr denoting the value of the size tensor.
+ //!
+ //! \see IPluginV3OneBuild::getOutputShapes()
+ //!
+ IDimensionExpr const* declareSizeTensor(int32_t outputIndex, IDimensionExpr const& opt, IDimensionExpr const& upper)
+ {
+ return mImpl->declareSizeTensor(outputIndex, opt, upper);
+ }
};
//!
//! \class DimsExprs
//!
-//! Analog of class Dims with expressions instead of constants for the dimensions.
+//! \brief Analog of class Dims with expressions instead of constants for the dimensions.
//!
class DimsExprs
{
@@ -318,9 +351,9 @@ class DimsExprs
};
//!
-//! \class DynamicPluginTensorDesc
+//! \struct DynamicPluginTensorDesc
//!
-//! Summarizes tensors that a plugin might see for an input or output.
+//! \brief Summarizes tensors that a plugin might see for an input or output.
//!
struct DynamicPluginTensorDesc
{
@@ -332,27 +365,42 @@ struct DynamicPluginTensorDesc
//! Upper bounds on tensor’s dimensions
Dims max;
+
+ //! Optimum value of tensor’s dimensions specified for auto-tuning
+ Dims opt;
};
//!
//! \class IPluginV2DynamicExt
//!
-//! Similar to IPluginV2Ext, but with support for dynamic shapes.
+//! \brief Similar to IPluginV2Ext, but with support for dynamic shapes.
//!
//! Clients should override the public methods, including the following inherited methods:
//!
-//! virtual int32_t getNbOutputs() const noexcept = 0;
-//! virtual nvinfer1::DataType getOutputDataType(int32_t index, nvinfer1::DataType const* inputTypes, int32_t
-//! nbInputs) const noexcept = 0; virtual size_t getSerializationSize() const noexcept = 0; virtual void
-//! serialize(void* buffer) const noexcept = 0; virtual void destroy() noexcept = 0; virtual void
-//! setPluginNamespace(char const* pluginNamespace) noexcept = 0; virtual char const* getPluginNamespace() const
-//! noexcept = 0;
+//! * virtual int32_t getNbOutputs() const noexcept = 0;
+//!
+//! * virtual DataType getOutputDataType(int32_t index, DataType const* inputTypes,
+//! int32_t nbInputs) const noexcept = 0;
+//!
+//! * virtual size_t getSerializationSize() const noexcept = 0;
//!
-//! For getOutputDataType, the inputTypes will always be DataType::kFLOAT or DataType::kINT32,
+//! * virtual void serialize(void* buffer) const noexcept = 0;
+//!
+//! * virtual void destroy() noexcept = 0;
+//!
+//! * virtual void setPluginNamespace(char const* pluginNamespace) noexcept = 0;
+//!
+//! * virtual char const* getPluginNamespace() const noexcept = 0;
+//!
+//! For weakly typed networks, the inputTypes will always be DataType::kFLOAT or DataType::kINT32,
//! and the returned type is canonicalized to DataType::kFLOAT if it is DataType::kHALF or DataType:kINT8.
+//! For strongly typed networks, inputTypes are inferred from previous operations, and getOutputDataType
+//! specifies the returned type based on the inputTypes.
//! Details about the floating-point precision are elicited later by method supportsFormatCombination.
//!
-class IPluginV2DynamicExt : public nvinfer1::IPluginV2Ext
+//! \deprecated Deprecated in TensorRT 10.0. Please implement IPluginV3 instead.
+//!
+class TRT_DEPRECATED IPluginV2DynamicExt : public nvinfer1::IPluginV2Ext
{
public:
IPluginV2DynamicExt* clone() const noexcept override = 0;
@@ -385,7 +433,7 @@ class IPluginV2DynamicExt : public nvinfer1::IPluginV2Ext
int32_t outputIndex, DimsExprs const* inputs, int32_t nbInputs, IExprBuilder& exprBuilder) noexcept = 0;
//!
- //! Limit on number of format combinations accepted.
+ //! \brief Limit on number of format combinations accepted.
//!
static constexpr int32_t kFORMAT_COMBINATION_LIMIT = 100;
@@ -406,18 +454,18 @@ class IPluginV2DynamicExt : public nvinfer1::IPluginV2Ext
//!
//! * A definition for a plugin that supports only FP16 NCHW:
//!
- //! return inOut.format[pos] == TensorFormat::kLINEAR && inOut.type[pos] == DataType::kHALF;
+ //! return inOut[pos].format == TensorFormat::kLINEAR && inOut[pos].type == DataType::kHALF;
//!
//! * A definition for a plugin that supports only FP16 NCHW for its two inputs,
//! and FP32 NCHW for its single output:
//!
- //! return inOut.format[pos] == TensorFormat::kLINEAR && (inOut.type[pos] == (pos < 2 ? DataType::kHALF :
+ //! return inOut[pos].format == TensorFormat::kLINEAR && (inOut[pos].type == (pos < 2 ? DataType::kHALF :
//! DataType::kFLOAT));
//!
//! * A definition for a "polymorphic" plugin with two inputs and one output that supports
//! any format or type, but the inputs and output must have the same format and type:
//!
- //! return pos == 0 || (inOut.format[pos] == inOut.format[0] && inOut.type[pos] == inOut.type[0]);
+ //! return pos == 0 || (inOut[pos].format == inOut.format[0] && inOut[pos].type == inOut[0].type);
//!
//! Warning: TensorRT will stop asking for formats once it finds kFORMAT_COMBINATION_LIMIT on combinations.
//!
@@ -450,9 +498,8 @@ class IPluginV2DynamicExt : public nvinfer1::IPluginV2Ext
//! * IExecutionContext will call this during the next subsequent instance enqueue[V2]() or execute[V2]() if:
//! - The batch size is changed from previous call of execute()/enqueue() if hasImplicitBatchDimension() returns
//! true.
- //! - The optimization profile is changed via setOptimizationProfile() or setOptimizationProfileAsync().
- //! - An input shape binding is changed via setInputShapeBinding().
- //! - An input execution binding is changed via setBindingDimensions().
+ //! - The optimization profile is changed via setOptimizationProfileAsync().
+ //! - An input execution binding is changed via setInputShape().
//! \warning The execution phase is timing critical during IExecutionContext but is not part of the timing loop when
//! called from IBuilder. Performance bottlenecks of configurePlugin won't show up during engine building but will
//! be visible during execution after calling functions that trigger layer resource updates.
@@ -510,53 +557,644 @@ class IPluginV2DynamicExt : public nvinfer1::IPluginV2Ext
private:
// Following are obsolete base class methods, and must not be implemented or used.
+ //!
+ //! \brief Set plugin configuration
+ //!
void configurePlugin(Dims const*, int32_t, Dims const*, int32_t, DataType const*, DataType const*, bool const*,
bool const*, PluginFormat, int32_t) noexcept override final
{
}
+ //!
+ //! \brief Check if provided data type is supported
+ //!
bool supportsFormat(DataType, PluginFormat) const noexcept override final
{
return false;
}
+ //!
+ //! \brief Get output dimensions.
+ //!
Dims getOutputDimensions(int32_t, Dims const*, int32_t) noexcept override final
{
return Dims{-1, {}};
}
- bool isOutputBroadcastAcrossBatch(int32_t, bool const*, int32_t) const noexcept override final
+ //!
+ //! \brief Is output broadcasted across batch.
+ //!
+ //! \warning Expected to return false as implicit batch support was removed in TensorRT 10.0.
+ //!
+ //! \deprecated Deprecated in TensorRT 10.0. Implicit batch support is removed in TensorRT 10.0.
+ //!
+ TRT_DEPRECATED bool isOutputBroadcastAcrossBatch(int32_t, bool const*, int32_t) const noexcept override final
{
return false;
}
- bool canBroadcastInputAcrossBatch(int32_t) const noexcept override final
+ //!
+ //! \brief Can output broadcasted across batch.
+ //!
+ //! \warning Expected to return false as implicit batch support was removed in TensorRT 10.0.
+ //!
+ //! \deprecated Deprecated in TensorRT 10.0. Implicit batch support is removed in TensorRT 10.0.
+ //!
+ TRT_DEPRECATED bool canBroadcastInputAcrossBatch(int32_t) const noexcept override final
{
return true;
}
- size_t getWorkspaceSize(int32_t) const noexcept override final
- {
- return 0;
- }
+ //!
+ //! \brief Get required workspace size in bytes.
+ //!
+ size_t getWorkspaceSize(int32_t) const noexcept override final
+ {
+ return 0;
+ }
+
+ //!
+ //! \brief Run inference.
+ //!
+ int32_t enqueue(int32_t, void const* const*, void* const*, void*, cudaStream_t) noexcept override final
+ {
+ return 1;
+ }
+};
+
+//!
+//! \class IPluginResourceContext
+//!
+//! \brief Interface for plugins to access per context resources provided by TensorRT
+//!
+//! There is no public way to construct an IPluginResourceContext. It appears as an argument to
+//! IPluginV3OneRuntime::attachToContext(). Overrides of that method can use the IPluginResourceContext object to access
+//! any available per context resources.
+//!
+//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
+//!
+//! \see IPluginV3OneRuntime::attachToContext()
+//!
+class IPluginResourceContext
+{
+public:
+ //! \brief Get the GPU allocator associated with the resource context
+ //!
+ //! \see IPluginV3OneRuntime::attachToContext()
+ //!
+ virtual IGpuAllocator* getGpuAllocator() const noexcept = 0;
+
+ //! \brief Get the error recorder associated with the resource context
+ //!
+ //! \see IPluginV3OneRuntime::attachToContext()
+ //!
+ virtual IErrorRecorder* getErrorRecorder() const noexcept = 0;
+ virtual ~IPluginResourceContext() noexcept = default;
+
+protected:
+ IPluginResourceContext() = default;
+ IPluginResourceContext(IPluginResourceContext const&) = default;
+ IPluginResourceContext(IPluginResourceContext&&) = default;
+ IPluginResourceContext& operator=(IPluginResourceContext const&) & = default;
+ IPluginResourceContext& operator=(IPluginResourceContext&&) & = default;
+};
+
+namespace v_1_0
+{
+class IPluginCapability : public IVersionedInterface
+{
+};
+} // namespace v_1_0
+
+//!
+//! \class IPluginCapability
+//!
+//! \brief Base class for plugin capability interfaces
+//!
+//! IPluginCapability represents a split in TensorRT V3 plugins to sub-objects that expose different types of
+//! capabilites a plugin may have, as opposed to a single interface which defines all capabilities and behaviors of a
+//! plugin.
+//!
+//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
+//!
+//! \see PluginCapabilityType
+//!
+using IPluginCapability = v_1_0::IPluginCapability;
+
+namespace v_1_0
+{
+class IPluginV3 : public IVersionedInterface
+{
+public:
+ //!
+ //! \brief Return version information associated with this interface. Applications must not override this method.
+ //!
+ InterfaceInfo getInterfaceInfo() const noexcept override
+ {
+ return InterfaceInfo{"PLUGIN", 1, 0};
+ }
+
+ //! \brief Return a pointer to plugin object implementing the specified PluginCapabilityType.
+ //!
+ //! \note IPluginV3 objects added for the build phase (through addPluginV3()) must return valid objects for
+ //! PluginCapabilityType::kCORE, PluginCapabilityType::kBUILD and PluginCapabilityType::kRUNTIME.
+ //!
+ //! \note IPluginV3 objects added for the runtime phase must return valid objects for
+ //! PluginCapabilityType::kCORE and PluginCapabilityType::kRUNTIME.
+ //!
+ //! \see TensorRTPhase
+ //! \see IPluginCreatorV3One::createPlugin()
+ //!
+ virtual IPluginCapability* getCapabilityInterface(PluginCapabilityType type) noexcept = 0;
+
+ //!
+ //! \brief Clone the plugin object. This copies over internal plugin parameters and returns a new plugin object with
+ //! these parameters. The cloned object must be in a fully initialized state.
+ //!
+ //! \note The cloned object must return valid objects through getCapabilityInterface() for at least the same
+ //! PluginCapabilityTypes as the original object.
+ //!
+ //! \return A cloned plugin object in an initialized state with the same parameters as the current object.
+ //! nullptr must be returned if the cloning fails.
+ //!
+ virtual IPluginV3* clone() noexcept = 0;
+};
+
+} // namespace v_1_0
+
+//!
+//! \class IPluginV3
+//!
+//! \brief Plugin class for the V3 generation of user-implemented layers.
+//!
+//! IPluginV3 acts as a wrapper around the plugin capability interfaces that define the actual behavior of the plugin.
+//!
+//! \see IPluginCapability
+//! \see IPluginCreatorV3One
+//! \see IPluginRegistry
+//!
+using IPluginV3 = v_1_0::IPluginV3;
+
+namespace v_1_0
+{
+class IPluginV3OneCore : public IPluginCapability
+{
+public:
+ //!
+ //! \brief Return version information associated with this interface. Applications must not override this method.
+ //!
+ InterfaceInfo getInterfaceInfo() const noexcept override
+ {
+ return InterfaceInfo{"PLUGIN_V3ONE_CORE", 1, 0};
+ }
+
+ //!
+ //! \brief Return the plugin name. Should match the plugin name returned by the corresponding plugin creator.
+ //!
+ //! \see IPluginCreatorV3One::getPluginName()
+ //!
+ //! \warning The string returned must be NULL-terminated and have a length of 1024 bytes or less including the
+ //! NULL terminator.
+ //!
+ virtual AsciiChar const* getPluginName() const noexcept = 0;
+
+ //!
+ //! \brief Return the plugin version. Should match the plugin version returned by the corresponding plugin creator.
+ //!
+ //! \see IPluginCreatorV3One::getPluginVersion()
+ //!
+ //! \warning The string returned must be NULL-terminated and have a length of 1024 bytes or less including the
+ //! NULL terminator.
+ //!
+ virtual AsciiChar const* getPluginVersion() const noexcept = 0;
+
+ //!
+ //! \brief Return the namespace of the plugin object. Should match the plugin namespace returned by the
+ //! corresponding plugin creator.
+ //!
+ //! \see IPluginCreatorV3One::getPluginNamespace()
+ //!
+ //! \warning The string returned must be NULL-terminated and have a length of 1024 bytes or less including the
+ //! NULL terminator.
+ //!
+ virtual AsciiChar const* getPluginNamespace() const noexcept = 0;
+};
+
+class IPluginV3OneBuild : public IPluginCapability
+{
+public:
+ //!
+ //! \brief The default maximum number of format combinations that will be timed by TensorRT during the build phase
+ //!
+ //! \see getFormatCombinationLimit
+ //!
+ static constexpr int32_t kDEFAULT_FORMAT_COMBINATION_LIMIT = 100;
+
+ //!
+ //! \brief Return version information associated with this interface. Applications must not override this method.
+ //!
+ InterfaceInfo getInterfaceInfo() const noexcept override
+ {
+ return InterfaceInfo{"PLUGIN_V3ONE_BUILD", 1, 0};
+ }
+
+ //!
+ //! \brief Configure the plugin.
+ //!
+ //! configurePlugin() can be called multiple times in the build phase during creation of an engine by IBuilder.
+ //!
+ //! configurePlugin() is called when a plugin is being prepared for profiling but not for any
+ //! specific input size. This provides an opportunity for the plugin to make algorithmic choices on the basis of
+ //! input and output formats, along with the bound of possible dimensions. The min, opt and max value of the
+ //! DynamicPluginTensorDesc correspond to the kMIN, kOPT and kMAX value of the current profile that the plugin is
+ //! being profiled for, with the desc.dims field corresponding to the dimensions of plugin specified at network
+ //! creation. Wildcard dimensions may exist during this phase in the desc.dims field.
+ //!
+ //! \param in The input tensors attributes that are used for configuration.
+ //! \param nbInputs Number of input tensors.
+ //! \param out The output tensors attributes that are used for configuration.
+ //! \param nbOutputs Number of output tensors.
+ //!
+ virtual int32_t configurePlugin(DynamicPluginTensorDesc const* in, int32_t nbInputs,
+ DynamicPluginTensorDesc const* out, int32_t nbOutputs) noexcept = 0;
+
+ //!
+ //! \brief Provide the data types of the plugin outputs if the input tensors have the data types provided.
+ //!
+ //! \param outputTypes Pre-allocated array to which the output data types should be written.
+ //! \param nbOutputs The number of output tensors. This matches the value returned from getNbOutputs().
+ //! \param inputTypes The input data types.
+ //! \param nbInputs The number of input tensors.
+ //!
+ //! \return 0 for success, else non-zero (which will cause engine termination). The returned code will be reported
+ //! through the error recorder.
+ //!
+ //! \note Provide `DataType::kFLOAT`s if the layer has no inputs. The data type for any size tensor outputs must be
+ //! `DataType::kINT32`. The returned data types must each have a format that is supported by the plugin.
+ //!
+ //! \warning DataType:kBOOL and DataType::kUINT8 are not supported.
+ //!
+ virtual int32_t getOutputDataTypes(
+ DataType* outputTypes, int32_t nbOutputs, const DataType* inputTypes, int32_t nbInputs) const noexcept = 0;
+
+ //!
+ //! \brief Provide expressions for computing dimensions of the output tensors from dimensions of the input tensors.
+ //!
+ //! \param inputs Expressions for dimensions of the input tensors
+ //! \param nbInputs The number of input tensors
+ //! \param shapeInputs Expressions for values of the shape tensor inputs
+ //! \param nbShapeInputs The number of shape tensor inputs
+ //! \param outputs Pre-allocated array to which the output dimensions must be written
+ //! \param exprBuilder Object for generating new dimension expressions
+ //!
+ //! \note Any size tensor outputs must be declared to be 0-D.
+ //!
+ //! \note The declaration of shapeInputs as DimsExprs is slightly abusive, because the "dimensions"
+ //! are actually the values of the shape tensor. For example, if the input shape tensor
+ //! is a 2x3 matrix, the DimsExprs will have six "dimensions": the three values from the first
+ //! row of the matrix followed by the three values from the second row of the matrix.
+ //!
+ //! \return 0 for success, else non-zero (which will cause engine termination). Returned code will be reported
+ //! through the error recorder.
+ //!
+ virtual int32_t getOutputShapes(DimsExprs const* inputs, int32_t nbInputs, DimsExprs const* shapeInputs,
+ int32_t nbShapeInputs, DimsExprs* outputs, int32_t nbOutputs, IExprBuilder& exprBuilder) noexcept = 0;
+
+ //!
+ //! \brief Return true if plugin supports the format and datatype for the input/output indexed by pos.
+ //!
+ //! For this method inputs are numbered 0.. (nbInputs - 1) and outputs are numbered nbInputs.. (nbInputs + nbOutputs
+ //! - 1). Using this numbering, pos is an index into InOut, where 0 <= pos < nbInputs + nbOutputs - 1.
+ //!
+ //! TensorRT invokes this method to ask if the input/output indexed by pos supports the format/datatype specified
+ //! by inOut[pos].format and inOut[pos].type. The override should return true if that format/datatype at inOut[pos]
+ //! are supported by the plugin. If support is conditional on other input/output formats/datatypes, the plugin can
+ //! make its result conditional on the formats/datatypes in inOut[0.. pos - 1], which will be set to values
+ //! that the plugin supports. The override should not inspect inOut[pos1.. nbInputs + nbOutputs - 1],
+ //! which will have invalid values. In other words, the decision for pos must be based on inOut[0..pos] only.
+ //!
+ //! Some examples:
+ //!
+ //! * A definition for a plugin that supports only FP16 NCHW:
+ //!
+ //! return inOut.format[pos] == TensorFormat::kLINEAR && inOut.type[pos] == DataType::kHALF;
+ //!
+ //! * A definition for a plugin that supports only FP16 NCHW for its two inputs,
+ //! and FP32 NCHW for its single output:
+ //!
+ //! return inOut.format[pos] == TensorFormat::kLINEAR && (inOut.type[pos] == pos < 2 ? DataType::kHALF :
+ //! DataType::kFLOAT);
+ //!
+ //! * A definition for a "polymorphic" plugin with two inputs and one output that supports
+ //! any format or type, but the inputs and output must have the same format and type:
+ //!
+ //! return pos == 0 || (inOut.format[pos] == inOut.format[0] && inOut.type[pos] == inOut.type[0]);
+ //!
+ //! \warning TensorRT will stop querying once it finds getFormatCombinationLimit() of combinations.
+ //!
+ //! \see getFormatCombinationLimit
+ //!
+ virtual bool supportsFormatCombination(
+ int32_t pos, DynamicPluginTensorDesc const* inOut, int32_t nbInputs, int32_t nbOutputs) noexcept = 0;
+
+ //!
+ //! \brief Get the number of outputs from the plugin.
+ //!
+ //! \return The number of outputs, which must be a positive integer.
+ //!
+ virtual int32_t getNbOutputs() const noexcept = 0;
+
+ //!
+ //! \brief Find the workspace size required by the layer.
+ //!
+ //! This function is called after the plugin is configured, and possibly during execution.
+ //! The result should be a sufficient workspace size to deal with inputs and outputs of the given size
+ //! or any smaller problem.
+ //!
+ //! \return The workspace size.
+ //!
+ virtual size_t getWorkspaceSize(DynamicPluginTensorDesc const* inputs, int32_t nbInputs,
+ DynamicPluginTensorDesc const* outputs, int32_t nbOutputs) const noexcept
+ {
+ return 0;
+ }
+
+ //!
+ //! \brief Query for any custom tactics that the plugin intends to use
+ //!
+ //! For each format combination supported by the plugin (up to a maximum indicated by getFormatCombinationLimit()),
+ //! the plugin will be timed for each tactic advertised through this method.
+ //!
+ //! \param tactics Pre-allocated buffer to which the tactic values should be written
+ //! \param nbTactics The number of tactics advertised through getNbTactics()
+ //!
+ //! \note The provided tactic values must be unique and non-zero. The tactic value 0 is reserved for the default
+ //! tactic attached to each format combination.
+ //!
+ //! \return 0 for success, else non-zero (which will cause engine termination). The returned code will be reported
+ //! through the error recorder.
+ //!
+ virtual int32_t getValidTactics(int32_t* tactics, int32_t nbTactics) noexcept
+ {
+ return 0;
+ }
+
+ //!
+ //! \brief Query for the number of custom tactics the plugin intends to use
+ //!
+ virtual int32_t getNbTactics() noexcept
+ {
+ return 0;
+ }
+
+ //!
+ //! \brief Called to query the suffix to use for the timing cache ID. May be called anytime after plugin creation.
+ //!
+ //! \return Suffix to use for timing cache ID, considering only the creation state of the plugin.
+ //! Returning nullptr will disable timing caching for the plugin altogether.
+ //!
+ //! \note If timing caching is enabled for the plugin (by returning non-null), the I/O shape and format information
+ //! will be automatically considered to form the prefix of the timing cache ID. Therefore, only other factors
+ //! determining the creation state of the plugin, such as its attribute values, should be considered to compose the
+ //! return value.
+ //!
+ virtual char const* getTimingCacheID() noexcept
+ {
+ return nullptr;
+ }
+
+ //!
+ //! \brief Return the maximum number of format combinations that will be timed by TensorRT during the build phase
+ //!
+ virtual int32_t getFormatCombinationLimit() noexcept
+ {
+ return kDEFAULT_FORMAT_COMBINATION_LIMIT;
+ }
+
+ //!
+ //! \brief Query for a string representing the configuration of the plugin. May be called anytime after
+ //! plugin creation.
+ //!
+ //! \return A string representing the plugin's creation state, especially with regard to its attribute values.
+ //!
+ virtual char const* getMetadataString() noexcept
+ {
+ return nullptr;
+ }
+};
+
+class IPluginV3OneRuntime : public IPluginCapability
+{
+public:
+ //!
+ //! \brief Return version information associated with this interface. Applications must not override this method.
+ //!
+ InterfaceInfo getInterfaceInfo() const noexcept override
+ {
+ return InterfaceInfo{"PLUGIN_V3ONE_RUNTIME", 1, 0};
+ }
+
+ //!
+ //! \brief Set the tactic to be used in the subsequent call to enqueue(). If no custom tactics were advertised, this
+ //! will have a value of 0, which is designated as the default tactic.
+ //!
+ //! \return 0 for success, else non-zero (which will cause engine termination). The returned code will be reported
+ //! through the error recorder.
+ //!
+ virtual int32_t setTactic(int32_t tactic) noexcept
+ {
+ return 0;
+ }
+
+ //!
+ //! \brief Called when a plugin is being prepared for execution for specific dimensions. This could
+ //! happen multiple times in the execution phase, both during creation of an engine by IBuilder and execution of an
+ //! engine by IExecutionContext.
+ //! * IBuilder will call this function once per profile, with `in` resolved to the values specified by the
+ //! kOPT field of the current profile.
+ //! * IExecutionContext will call this during the next subsequent instance of enqueueV3() or executeV2() if:
+ //! - The optimization profile is changed via setOptimizationProfile() or setOptimizationProfileAsync().
+ //! - An input binding is changed via setInputTensorAddress() or setTensorAddress() or setInputShape().
+ //! \warning The execution phase is timing critical during IExecutionContext but is not part of the timing loop when
+ //! called from IBuilder. Performance bottlenecks of onShapeChange() will not show up during engine building but
+ //! will be visible during execution if any triggering functions are called.
+ //!
+ //! \param in The input tensors attributes that are used for configuration.
+ //! \param nbInputs Number of input tensors.
+ //! \param out The output tensors attributes that are used for configuration.
+ //! \param nbOutputs Number of output tensors.
+ //!
+ virtual int32_t onShapeChange(
+ PluginTensorDesc const* in, int32_t nbInputs, PluginTensorDesc const* out, int32_t nbOutputs) noexcept = 0;
+
+ //!
+ //! \brief Execute the layer.
+ //!
+ //! \param inputDesc how to interpret the memory for the input tensors.
+ //! \param outputDesc how to interpret the memory for the output tensors.
+ //! \param inputs The memory for the input tensors.
+ //! \param outputs The memory for the output tensors.
+ //! \param workspace Workspace for execution.
+ //! \param stream The stream in which to execute the kernels.
+ //!
+ //! \return 0 for success, else non-zero (which will cause engine termination). The returned code will be reported
+ //! through the error recorder.
+ //!
+ virtual int32_t enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* outputDesc,
+ void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) noexcept = 0;
+
+ //!
+ //! \brief Clone the plugin, attach the cloned plugin object to a execution context and grant the cloned plugin
+ //! access to some context resources.
+ //!
+ //! This function is called automatically for each plugin when a new execution context is created. The plugin may
+ //! use resources provided by the IPluginResourceContext until the plugin is deleted by TensorRT.
+ //!
+ //! If the plugin needs per-context resources, it can be allocated here.
+ //!
+ //! \param context A resource context that exposes methods to get access to execution context specific resources.
+ //! A different resource context is guaranteed for each different execution context to which the
+ //! plugin is attached.
+ //! \see IPluginResourceContext
+ //!
+ //! \note This method should clone the entire IPluginV3 object, not just the runtime interface
+ //!
+ //! \return A clone of the IPluginV3 object whose runtime interface on which this method is invoked, which has
+ //! attached to the provided resource context.
+ //!
+ virtual IPluginV3* attachToContext(IPluginResourceContext* context) noexcept = 0;
+
+ //!
+ //! \brief Get the plugin fields which should be serialized.
+ //!
+ //! \note The set of plugin fields returned does not necessarily need to match that advertised through
+ //! getFieldNames() of the corresponding plugin creator.
+
+ //! \note To serialize arbitrary plugin data, use a PluginField of
+ //! PluginFieldType::kUNKNOWN, with the length of the PluginField set to the correct number of bytes.
+ //!
+ virtual PluginFieldCollection const* getFieldsToSerialize() noexcept = 0;
+};
+} // namespace v_1_0
+
+//!
+//! \class IPluginV3OneCore
+//!
+//! \brief A plugin capability interface that enables the core capability (PluginCapabilityType::kCORE).
+//!
+//! \see IPluginCapability
+//! \see PluginCapabilityType
+//! \see IPluginV3::getCapabilityInterface()
+//!
+using IPluginV3OneCore = v_1_0::IPluginV3OneCore;
+
+//!
+//! \class IPluginV3OneBuild
+//!
+//! \brief A plugin capability interface that enables the build capability (PluginCapabilityType::kBUILD). Exposes
+//! methods that allow the expression of the build time properties and behavior of a plugin.
+//!
+//! \see IPluginCapability
+//! \see PluginCapabilityType
+//! \see IPluginV3::getCapabilityInterface()
+//!
+using IPluginV3OneBuild = v_1_0::IPluginV3OneBuild;
+
+//!
+//! \class IPluginV3OneRuntime
+//!
+//! \brief A plugin capability interface that enables the runtime capability (PluginCapabilityType::kRUNTIME). Exposes
+//! methods that allow the expression of the runtime properties and behavior of a plugin.
+//!
+//! \see IPluginCapability
+//! \see PluginCapabilityType
+//! \see IPluginV3::getCapabilityInterface()
+//!
+using IPluginV3OneRuntime = v_1_0::IPluginV3OneRuntime;
+
+namespace v_1_0
+{
+class IPluginCreatorV3One : public IPluginCreatorInterface
+{
+public:
+ //!
+ //! \brief Return version information associated with this interface. Applications must not override this method.
+ //!
+ InterfaceInfo getInterfaceInfo() const noexcept override
+ {
+ return InterfaceInfo{"PLUGIN CREATOR_V3ONE", 1, 0};
+ }
+
+ //!
+ //! \brief Return a plugin object. Return nullptr in case of error.
+ //!
+ //! \param name A NULL-terminated name string of length 1024 or less, including the NULL terminator.
+ //! \param fc A pointer to a collection of fields needed for constructing the plugin.
+ //! \param phase The TensorRT phase in which the plugin is being created
+ //!
+ //! When the phase is TensorRTPhase::kRUNTIME, the PluginFieldCollection provided for serialization by the plugin's
+ //! runtime interface will be passed as fc.
+ //!
+ //! \note The returned plugin object must be in an initialized state
+ //!
+ virtual IPluginV3* createPlugin(
+ AsciiChar const* name, PluginFieldCollection const* fc, TensorRTPhase phase) noexcept = 0;
+
+ //!
+ //! \brief Return a list of fields that need to be passed to createPlugin() when creating a plugin for use in the
+ //! TensorRT build phase.
+ //!
+ //! \see PluginFieldCollection
+ //!
+ virtual PluginFieldCollection const* getFieldNames() noexcept = 0;
+
+ //!
+ //! \brief Return the plugin name.
+ //!
+ //! \warning The string returned must be NULL-terminated and have a length of 1024 bytes or less including
+ //! the NULL terminator.
+ //!
+ virtual AsciiChar const* getPluginName() const noexcept = 0;
+
+ //!
+ //! \brief Return the plugin version.
+ //!
+ //! \warning The string returned must be NULL-terminated and have a length of 1024 bytes or less including
+ //! the NULL terminator.
+ //!
+ virtual AsciiChar const* getPluginVersion() const noexcept = 0;
+
+ //!
+ //! \brief Return the plugin namespace.
+ //!
+ //! \warning The string returned must be NULL-terminated and have a length of 1024 bytes or less including
+ //! the NULL terminator.
+ //!
+ virtual AsciiChar const* getPluginNamespace() const noexcept = 0;
+
+ IPluginCreatorV3One() = default;
+ virtual ~IPluginCreatorV3One() = default;
- int32_t enqueue(int32_t, void const* const*, void* const*, void*, cudaStream_t) noexcept override final
- {
- return 1;
- }
+protected:
+ IPluginCreatorV3One(IPluginCreatorV3One const&) = default;
+ IPluginCreatorV3One(IPluginCreatorV3One&&) = default;
+ IPluginCreatorV3One& operator=(IPluginCreatorV3One const&) & = default;
+ IPluginCreatorV3One& operator=(IPluginCreatorV3One&&) & = default;
};
+} // namespace v_1_0
//!
-//! \class IProfiler
-//!
-//! \brief Application-implemented interface for profiling.
+//! \class IPluginCreatorV3One
//!
-//! When this class is added to an execution context, the profiler will be called once per layer for each invocation of
-//! executeV2()/enqueueV2()/enqueueV3().
+//! \brief A plugin creator class capable of producing IPluginV3 objects
//!
-//! It is not recommended to run inference with profiler enabled when the inference execution time is critical since the
-//! profiler may affect execution time negatively.
+//! \see IPluginV3
+//! \see IPluginRegistry
//!
+using IPluginCreatorV3One = v_1_0::IPluginCreatorV3One;
+
+namespace v_1_0
+{
class IProfiler
{
public:
@@ -571,17 +1209,32 @@ class IProfiler
virtual ~IProfiler() noexcept {}
};
+} // namespace v_1_0
+
+//!
+//! \class IProfiler
+//!
+//! \brief Application-implemented interface for profiling.
+//!
+//! When this class is added to an execution context, the profiler will be called once per layer for each invocation of
+//! executeV2()/enqueueV3().
+//!
+//! It is not recommended to run inference with profiler enabled when the inference execution time is critical since the
+//! profiler may affect execution time negatively.
+//!
+using IProfiler = v_1_0::IProfiler;
//!
//! \enum WeightsRole
+//!
//! \brief How a layer uses particular Weights.
//!
//! The power weights of an IScaleLayer are omitted. Refitting those is not supported.
//!
enum class WeightsRole : int32_t
{
- kKERNEL = 0, //!< kernel for IConvolutionLayer, IDeconvolutionLayer, or IFullyConnectedLayer
- kBIAS = 1, //!< bias for IConvolutionLayer, IDeconvolutionLayer, or IFullyConnectedLayer
+ kKERNEL = 0, //!< kernel for IConvolutionLayer or IDeconvolutionLayer
+ kBIAS = 1, //!< bias for IConvolutionLayer or IDeconvolutionLayer
kSHIFT = 2, //!< shift part of IScaleLayer
kSCALE = 3, //!< scale part of IScaleLayer
kCONSTANT = 4, //!< weights for IConstantLayer
@@ -602,8 +1255,8 @@ constexpr inline int32_t EnumMax() noexcept
//!
enum class DeviceType : int32_t
{
- kGPU, //!< GPU Device
- kDLA, //!< DLA Core
+ kGPU = 0, //!< GPU Device
+ kDLA = 1, //!< DLA Core
};
//! Maximum number of elements in DeviceType enum. \see DeviceType
@@ -641,6 +1294,7 @@ constexpr inline int32_t EnumMax() noexcept
return 2;
}
+//!
//! \brief Represents a collection of one or more TempfileControlFlag values combined using bitwise-OR operations.
//!
//! \see TempfileControlFlag,
@@ -660,29 +1314,9 @@ class IRuntime : public INoCopy
public:
virtual ~IRuntime() noexcept = default;
- //!
- //! \brief Deserialize an engine from a stream.
- //!
- //! If an error recorder has been set for the runtime, it will also be passed to the engine.
- //!
- //! \param blob The memory that holds the serialized engine.
- //! \param size The size of the memory in bytes.
- //! \param pluginFactory The plugin factory, if any plugins are used by the network, otherwise nullptr.
- //!
- //! \return The engine, or nullptr if it could not be deserialized.
- //!
- //! \deprecated Deprecated in TensorRT 8.0.
- //!
- //! \warning IPluginFactory is no longer supported, therefore pluginFactory must be a nullptr.
- //!
- TRT_DEPRECATED nvinfer1::ICudaEngine* deserializeCudaEngine(
- void const* blob, std::size_t size, IPluginFactory* pluginFactory) noexcept
- {
- return mImpl->deserializeCudaEngine(blob, size, nullptr);
- }
-
//!
//! \brief Sets the DLA core used by the network. Defaults to -1.
+ //!
//! \param dlaCore The DLA core to execute the engine on, in the range [0,getNbDlaCores()).
//!
//! This function is used to specify which DLA core to use via indexing, if multiple DLA cores are available.
@@ -698,6 +1332,7 @@ class IRuntime : public INoCopy
//!
//! \brief Get the DLA core that the engine executes on.
+ //!
//! \return assigned DLA core or -1 for DLA not present or unset.
//!
int32_t getDLACore() const noexcept
@@ -713,20 +1348,9 @@ class IRuntime : public INoCopy
return mImpl->getNbDLACores();
}
- //!
- //! \brief Destroy this object.
- //!
- //! \deprecated Deprecated in TRT 8.0. Superseded by `delete`.
- //!
- //! \warning Calling destroy on a managed pointer will result in a double-free error.
- //!
- TRT_DEPRECATED void destroy() noexcept
- {
- delete this;
- }
-
//!
//! \brief Set the GPU allocator.
+ //!
//! \param allocator Set the GPU allocator to be used by the runtime. All GPU memory acquired will use this
//! allocator. If NULL is passed, the default allocator will be used.
//!
@@ -774,7 +1398,7 @@ class IRuntime : public INoCopy
}
//!
- //! \brief Deserialize an engine from a stream.
+ //! \brief Deserialize an engine from host memory.
//!
//! If an error recorder has been set for the runtime, it will also be passed to the engine.
//!
@@ -785,7 +1409,25 @@ class IRuntime : public INoCopy
//!
ICudaEngine* deserializeCudaEngine(void const* blob, std::size_t size) noexcept
{
- return mImpl->deserializeCudaEngine(blob, size, nullptr);
+ return mImpl->deserializeCudaEngine(blob, size);
+ }
+
+ //!
+ //! \brief Deserialize an engine from a stream.
+ //!
+ //! If an error recorder has been set for the runtime, it will also be passed to the
+ //! engine.
+ //!
+ //! This deserialization path will reduce host memory usage when weight streaming is enabled.
+ //!
+ //! \param streamReader a read-only stream from which TensorRT will deserialize a
+ //! previously serialized engine.
+ //!
+ //! \return The engine, or nullptr if it could not be deserialized.
+ //!
+ ICudaEngine* deserializeCudaEngine(IStreamReader& streamReader)
+ {
+ return mImpl->deserializeCudaEngine(streamReader);
}
//!
@@ -800,6 +1442,7 @@ class IRuntime : public INoCopy
//!
//! \brief Set the maximum number of threads.
+ //!
//! \param maxThreads The maximum number of threads that can be used by the runtime.
//! \return True if successful, false otherwise.
//!
@@ -973,9 +1616,11 @@ class IRefitter : public INoCopy
//!
//! * There is no such layer by that name.
//! * The layer does not have weights with the specified role.
- //! * The number of weights is inconsistent with the layer’s original specification.
+ //! * The count of weights is inconsistent with the layer’s original specification.
+ //! * The type of weights is inconsistent with the layer’s original specification.
//!
- //! Modifying the weights before method refit() completes will result in undefined behavior.
+ //! Modifying the weights before method refitCudaEngine or refitCudaEngineAsync returns will result in undefined
+ //! behavior.
//!
//! \warning The string layerName must be null-terminated, and be at most 4096 bytes including the terminator.
//!
@@ -985,14 +1630,16 @@ class IRefitter : public INoCopy
}
//!
- //! \brief Updates associated engine. Return true if successful.
+ //! \brief Refits associated engine.
//!
- //! Failure occurs if getMissing() != 0 before the call.
+ //! \return True on success, or false if new weights validation fails or getMissingWeights() != 0 before the call.
+ //! If false is returned, a subset of weights may have been refitted.
//!
//! The behavior is undefined if the engine has pending enqueued work.
+ //! Provided weights on CPU or GPU can be unset and released, or updated after refitCudaEngine returns.
//!
- //! Extant IExecutionContexts associated with the engine should not be used afterwards.
- //! Instead, create new IExecutionContexts after refitting.
+ //! IExecutionContexts associated with the engine remain valid for use afterwards. There is no need to set the same
+ //! weights repeatedly for multiple refit calls as the weights memory can be updated directly instead.
//!
bool refitCudaEngine() noexcept
{
@@ -1037,16 +1684,6 @@ class IRefitter : public INoCopy
return mImpl->getAll(size, layerNames, roles);
}
- //!
- //! \deprecated Deprecated in TRT 8.0. Superseded by `delete`.
- //!
- //! \warning Calling destroy on a managed pointer will result in a double-free error.
- //!
- TRT_DEPRECATED void destroy() noexcept
- {
- delete this;
- }
-
//!
//! Update dynamic range for a tensor.
//!
@@ -1155,9 +1792,13 @@ class IRefitter : public INoCopy
//! Possible reasons for rejection are:
//!
//! * The name of weights is nullptr or does not correspond to any refittable weights.
- //! * The number of weights is inconsistent with the original specification.
+ //! * The count of the weights is inconsistent with the count returned from calling getWeightsPrototype() with the
+ //! same name.
+ //! * The type of the weights is inconsistent with the type returned from calling getWeightsPrototype() with the
+ //! same name.
//!
- //! Modifying the weights before method refitCudaEngine() completes will result in undefined behavior.
+ //! Modifying the weights before method refitCudaEngine or refitCudaEngineAsync returns will result in undefined
+ //! behavior.
//!
//! \warning The string name must be null-terminated, and be at most 4096 bytes including the terminator.
//!
@@ -1214,7 +1855,9 @@ class IRefitter : public INoCopy
//!
//! \brief Set the maximum number of threads.
+ //!
//! \param maxThreads The maximum number of threads that can be used by the refitter.
+ //!
//! \return True if successful, false otherwise.
//!
//! The default value is 1 and includes the current thread.
@@ -1240,6 +1883,145 @@ class IRefitter : public INoCopy
return mImpl->getMaxThreads();
}
+ //!
+ //! \brief Specify new weights on a specified device of given name.
+ //!
+ //! \param name The name of the weights to be refitted.
+ //! \param weights The new weights on the specified device.
+ //! \param location The location (host vs. device) of the new weights.
+ //!
+ //! \return True on success, or false if new weights are rejected.
+ //! Possible reasons for rejection are:
+ //!
+ //! * The name of the weights is nullptr or does not correspond to any refittable weights.
+ //! * The count of the weights is inconsistent with the count returned from calling getWeightsPrototype() with the
+ //! same name.
+ //! * The type of the weights is inconsistent with the type returned from calling getWeightsPrototype() with the
+ //! same name.
+ //!
+ //! It is allowed to provide some weights on CPU and others on GPU.
+ //! Modifying the weights before the method refitCudaEngine() or refitCudaEngineAsync() completes will result in
+ //! undefined behavior.
+ //!
+ //! \warning The string name must be null-terminated, and be at most 4096 bytes including the terminator.
+ //!
+ bool setNamedWeights(char const* name, Weights weights, TensorLocation location) noexcept
+ {
+ return mImpl->setNamedWeightsWithLocation(name, weights, location);
+ }
+
+ //!
+ //! \brief Get weights associated with the given name.
+ //!
+ //! \param weightsName The name of the weights to be refitted.
+ //!
+ //! \return Weights associated with the given name.
+ //!
+ //! If the weights were never set, returns null weights and reports an error to the refitter errorRecorder.
+ //!
+ //! \warning The string weightsName must be null-terminated, and be at most 4096 bytes including the terminator.
+ //!
+ Weights getNamedWeights(char const* weightsName) const noexcept
+ {
+ return mImpl->getNamedWeights(weightsName);
+ }
+
+ //!
+ //! \brief Get location for the weights associated with the given name.
+ //!
+ //! \param weightsName The name of the weights to be refitted.
+ //!
+ //! \return Location for the weights associated with the given name.
+ //!
+ //! If the weights were never set, returns TensorLocation::kHOST and reports an error to the refitter errorRecorder.
+ //!
+ //! \warning The string weightsName must be null-terminated, and be at most 4096 bytes including the terminator.
+ //!
+ TensorLocation getWeightsLocation(char const* weightsName) const noexcept
+ {
+ return mImpl->getWeightsLocation(weightsName);
+ }
+
+ //!
+ //! \brief Unset weights associated with the given name.
+ //!
+ //! \param weightsName The name of the weights to be refitted.
+ //!
+ //! \return False if the weights were never set, returns true otherwise.
+ //!
+ //! Unset weights before releasing them.
+ //!
+ //! \warning The string weightsName must be null-terminated, and be at most 4096 bytes including the terminator.
+ //!
+ bool unsetNamedWeights(char const* weightsName) noexcept
+ {
+ return mImpl->unsetNamedWeights(weightsName);
+ }
+
+ //!
+ //! \brief Set whether to validate weights during refitting.
+ //!
+ //! \param weightsValidation Indicate whether to validate weights during refitting.
+ //!
+ //! When set to true, TensorRT will validate weights during FP32 to FP16/BF16 weights conversions or
+ //! sparsifying weights in the refit call. If provided weights are not proper for some weights transformations,
+ //! TensorRT will issue a warning and continue the transformation for minor issues (such as overflow during
+ //! narrowing conversion), or issue an error and stop the refitting process for severe issues (such as sparsifying
+ //! dense weights). By default the flag is true. Set the flag to false for faster refitting performance.
+ //!
+ void setWeightsValidation(bool weightsValidation) noexcept
+ {
+ return mImpl->setWeightsValidation(weightsValidation);
+ }
+
+ //!
+ //! \brief Get whether to validate weights values during refitting.
+ //!
+ bool getWeightsValidation() const noexcept
+ {
+ return mImpl->getWeightsValidation();
+ }
+
+ //!
+ //! \brief Enqueue weights refitting of the associated engine on the given stream.
+ //!
+ //! \param stream The stream to enqueue the weights updating task.
+ //!
+ //! \return True on success, or false if new weights validation fails or getMissingWeights() != 0 before the call.
+ //! If false is returned, a subset of weights may have been refitted.
+ //!
+ //! The behavior is undefined if the engine has pending enqueued work on a different stream from the provided one.
+ //! Provided weights on CPU can be unset and released, or updated after refitCudaEngineAsync returns.
+ //! Freeing or updating of the provided weights on GPU can be enqueued on the same stream after refitCudaEngineAsync
+ //! returns.
+ //!
+ //! IExecutionContexts associated with the engine remain valid for use afterwards. There is no need to set the same
+ //! weights repeatedly for multiple refit calls as the weights memory can be updated directly instead. The weights
+ //! updating task should use the same stream as the one used for the refit call.
+ //!
+ bool refitCudaEngineAsync(cudaStream_t stream) noexcept
+ {
+ return mImpl->refitCudaEngineAsync(stream);
+ }
+
+ //!
+ //! \brief Get the Weights prototype associated with the given name.
+ //!
+ //! \param weightsName The name of the weights to be refitted.
+ //!
+ //! \return Weights prototype associated with the given name.
+ //!
+ //! The type and count of weights prototype is the same as weights used for engine building. The values property
+ //! is nullptr for weights prototypes. The count of the weights prototype is -1 when the name of the weights is
+ //! nullptr or does not correspond to any refittable weights.
+ //!
+ //! \warning The string weightsName must be null-terminated, and be at most 4096 bytes including the terminator.
+ //!
+ Weights getWeightsPrototype(char const* weightsName) const noexcept
+ {
+ return mImpl->getWeightsPrototype(weightsName);
+ }
+
protected:
apiv::VRefitter* mImpl;
};
@@ -1324,7 +2106,7 @@ class IOptimizationProfile : public INoCopy
//!
//! \warning The string inputName must be null-terminated, and be at most 4096 bytes including the terminator.
//!
- bool setDimensions(char const* inputName, OptProfileSelector select, Dims dims) noexcept
+ bool setDimensions(char const* inputName, OptProfileSelector select, Dims const& dims) noexcept
{
return mImpl->setDimensions(inputName, select, dims);
}
@@ -1345,18 +2127,19 @@ class IOptimizationProfile : public INoCopy
//! \brief Set the minimum / optimum / maximum values for an input shape tensor.
//!
//! This function must be called three times for every input tensor t that is a shape tensor (t.isShape() == true).
- //! This implies that the datatype of t is DataType::kINT32, the rank is either 0 or 1, and the dimensions of t
- //! are fixed at network definition time. This function must not be called for any input tensor that is not a
- //! shape tensor.
+ //! This implies that the dimensions of t are fixed at network definition time and the volume does not exceed 64.
+ //! This function must not be called for any input tensor that is not a shape tensor.
//!
//! Each time this function is called for the same input tensor, the same nbValues must be supplied (either 1
//! if the tensor rank is 0, or dims.d[0] if the rank is 1). Furthermore, if minVals, optVals, maxVals are the
//! minimum, optimum, and maximum values, it must be true that minVals[i] <= optVals[i] <= maxVals[i] for
//! i = 0, ..., nbValues - 1. Execution of the network must be valid for the optVals.
//!
- //! Shape tensors are tensors that contribute to shape calculations in some way, and can contain
- //! any int32_t values appropriate for the network. Shape tensors of other data types (e.g. float) are not
- //! supported. Examples:
+ //! Shape tensors are tensors that contribute to shape calculations in some way. While input shape tensors can be
+ //! type kBOOL, kINT32, or kINT64, the values used to set the minimum, optimium, and maximum values must fit in int32_t.
+ //! Boolean values are represented as 0 for false and 1 for true.
+ //!
+ //! Examples:
//!
//! * A shape tensor used as the second input to IShuffleLayer can contain a -1 wildcard.
//! The corresponding minVal[i] should be -1.
@@ -1372,6 +2155,7 @@ class IOptimizationProfile : public INoCopy
//! \param inputName The input tensor name
//! \param select Whether to set the minimum, optimum, or maximum input values.
//! \param values An array of length nbValues containing the minimum, optimum, or maximum shape tensor elements.
+ //! For multidimensional tensors, the array is in row-major order.
//! \param nbValues The length of the value array, which must equal the number of shape tensor elements (>= 1)
//!
//! \return false if an inconsistency was detected (e.g. nbValues does not match a previous call for the same
@@ -1470,20 +2254,23 @@ class IOptimizationProfile : public INoCopy
//!
//! \brief List of tactic sources for TensorRT.
//!
-//! \see TacticSources, IBuilderConfig::setTacticSources(), IBuilderConfig::getTacticSources(),
-//! PreviewFeature::kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805
+//! \see TacticSources, IBuilderConfig::setTacticSources(), IBuilderConfig::getTacticSources()
//!
enum class TacticSource : int32_t
{
- //! cuBLAS tactics. Enabled by default.
- //! \note Disabling kCUBLAS will cause the cublas handle passed to plugins in attachToContext to be null.
- kCUBLAS = 0,
- //! cuBLAS LT tactics.
- //! Enabled for x86 platforms and only enabled for non-x86 platforms when CUDA >= 11.0 by default.
- kCUBLAS_LT = 1,
- //! cuDNN tactics. Enabled by default.
+ //! cuBLAS tactics. Disabled by default.
+ //! \note Disabling kCUBLAS will cause the cuBLAS handle passed to plugins in attachToContext to be null.
+ //! \deprecated Deprecated in TensorRT 10.0.
+ kCUBLAS TRT_DEPRECATED_ENUM = 0,
+
+ //! cuBLAS LT tactics. Enabled by default.
+ //! \deprecated Deprecated in TensorRT 9.0.
+ kCUBLAS_LT TRT_DEPRECATED_ENUM = 1,
+
+ //! cuDNN tactics. Disabled by default.
//! \note Disabling kCUDNN will cause the cuDNN handle passed to plugins in attachToContext to be null.
- kCUDNN = 2,
+ //! \deprecated Deprecated in TensorRT 10.0.
+ kCUDNN TRT_DEPRECATED_ENUM = 2,
//! Enables convolution tactics implemented with edge mask tables. These tactics tradeoff memory for performance by
//! consuming additional memory space proportional to the input size.
@@ -1523,11 +2310,6 @@ enum class ProfilingVerbosity : int32_t
kLAYER_NAMES_ONLY = 0, //!< Print only the layer names. This is the default setting.
kNONE = 1, //!< Do not print any layer information.
kDETAILED = 2, //!< Print detailed layer information including layer names and layer parameters.
-
- //! \deprecated Deprecated in TensorRT 8.0. Superseded by kLAYER_NAMES_ONLY.
- kDEFAULT TRT_DEPRECATED_ENUM = kLAYER_NAMES_ONLY,
- //! \deprecated Deprecated in TensorRT 8.0. Superseded by kDETAILED.
- kVERBOSE TRT_DEPRECATED_ENUM = kDETAILED
};
//! Maximum number of profile verbosity levels in ProfilingVerbosity enum. \see ProfilingVerbosity
@@ -1538,127 +2320,154 @@ constexpr inline int32_t EnumMax() noexcept
}
//!
-//! \class ICudaEngine
+//! \brief Represents one or more SerializationFlag values using binary OR
+//! operations, e.g., 1U << SerializationFlag::kEXCLUDE_LEAN_RUNTIME
//!
-//! \brief An engine for executing inference on a built network, with functionally unsafe features.
+//! \see ISerializationConfig::setFlags(), ISerializationConfig::getFlags()
//!
-//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
+using SerializationFlags = uint32_t;
+
//!
-class ICudaEngine : public INoCopy
+//! \enum SerializationFlag
+//!
+//! \brief List of valid flags that the engine can enable when serializing the bytes.
+//!
+//! \see ISerializationConfig::setFlags(), ISerializationConfig::getFlags()
+//!
+enum class SerializationFlag : int32_t
+{
+ kEXCLUDE_WEIGHTS = 0, //!< Exclude the weights that can be refitted.
+ kEXCLUDE_LEAN_RUNTIME = 1, //!< Exclude the lean runtime.
+};
+
+//! Maximum number of serialization flags in SerializationFlag enum. \see SerializationFlag
+template <>
+constexpr inline int32_t EnumMax() noexcept
+{
+ return 2;
+}
+
+//!
+//! \class ISerializationConfig
+//!
+//! \brief Holds properties for configuring an engine to serialize the binary.
+//!
+//! \see SerializationFlag
+//!
+class ISerializationConfig : public INoCopy
{
public:
- virtual ~ICudaEngine() noexcept = default;
+ virtual ~ISerializationConfig() noexcept = default;
//!
- //! \brief Get the number of binding indices.
+ //! \brief Set the serialization flags to turn on for this config.
+ //!
+ //! The flags are listed in the SerializationFlag enum.
//!
- //! There are separate binding indices for each optimization profile.
- //! This method returns the total over all profiles.
- //! If the engine has been built for K profiles, the first getNbBindings() / K bindings are used by profile
- //! number 0, the following getNbBindings() / K bindings are used by profile number 1 etc.
+ //! \param serializationFlags The serialization flags for an engine.
//!
- //! \deprecated Deprecated in TensorRT 8.5. Superseded by getNbIOTensors.
+ //! \note This function will override the previous set flags, rather than bitwise ORing the new flag.
//!
- //! \see getBindingIndex()
+ //! \see getFlags()
//!
- TRT_DEPRECATED int32_t getNbBindings() const noexcept
+ bool setFlags(SerializationFlags serializationFlags) noexcept
{
- return mImpl->getNbBindings();
+ return mImpl->setFlags(serializationFlags);
}
//!
- //! \brief Retrieve the binding index for a named tensor.
- //!
- //! IExecutionContext::enqueueV2() and IExecutionContext::executeV2() require an array of buffers.
- //!
- //! Engine bindings map from tensor names to indices in this array.
- //! Binding indices are assigned at engine build time, and take values in the range [0 ... n-1] where n is the total
- //! number of inputs and outputs.
- //!
- //! To get the binding index of the name in an optimization profile with index k > 0,
- //! mangle the name by appending " [profile k]", as described for method getBindingName().
- //!
- //! \param name The tensor name.
- //! \return The binding index for the named tensor, or -1 if the provided name does not map to an input or output
- //! tensor.
+ //! \brief Get the serialization flags for this config.
//!
- //! \warning The string name must be null-terminated, and be at most 4096 bytes including the terminator.
- //!
- //! \deprecated Deprecated in TensorRT 8.5. Superseded by name-based methods. Use them instead of binding-index
- //! based methods.
+ //! \return The serialization flags as a bitmask.
//!
- //! \see getNbBindings() getBindingName()
+ //! \see setFlags()
//!
- TRT_DEPRECATED int32_t getBindingIndex(char const* name) const noexcept
+ SerializationFlags getFlags() const noexcept
{
- return mImpl->getBindingIndex(name);
+ return mImpl->getFlags();
}
//!
- //! \brief Retrieve the name corresponding to a binding index.
- //!
- //! This is the reverse mapping to that provided by getBindingIndex().
- //!
- //! For optimization profiles with an index k > 0, the name is mangled by appending
- //! " [profile k]", with k written in decimal. For example, if the tensor in the
- //! INetworkDefinition had the name "foo", and bindingIndex refers to that tensor in the
- //! optimization profile with index 3, getBindingName returns "foo [profile 3]".
+ //! \brief clear a serialization flag.
//!
- //! \param bindingIndex The binding index.
- //! \return The name corresponding to the index, or nullptr if the index is out of range.
+ //! clears the serialization flag from the config.
//!
- //! \deprecated Deprecated in TensorRT 8.5. Superseded by name-based methods. Use them instead of binding-index
- //! based methods.
+ //! \see setFlags()
//!
- //! \see getBindingIndex()
- //!
- TRT_DEPRECATED char const* getBindingName(int32_t bindingIndex) const noexcept
+ bool clearFlag(SerializationFlag serializationFlag) noexcept
{
- return mImpl->getBindingName(bindingIndex);
+ return mImpl->clearFlag(serializationFlag);
}
//!
- //! \brief Determine whether a binding is an input binding.
- //!
- //! \param bindingIndex The binding index.
- //! \return True if the index corresponds to an input binding and the index is in range.
+ //! \brief Set a serialization flag.
//!
- //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorIOMode().
+ //! Add the input serialization flag to the already enabled flags.
//!
- //! \see getTensorIOMode()
+ //! \see setFlags()
//!
- TRT_DEPRECATED bool bindingIsInput(int32_t bindingIndex) const noexcept
+ bool setFlag(SerializationFlag serializationFlag) noexcept
{
- return mImpl->bindingIsInput(bindingIndex);
+ return mImpl->setFlag(serializationFlag);
}
//!
- //! \brief Get the dimensions of a binding.
- //!
- //! \param bindingIndex The binding index.
- //! \return The dimensions of the binding if the index is in range, otherwise Dims().
- //! Has -1 for any dimension that varies within the optimization profile.
- //!
- //! For example, suppose an INetworkDefinition has an input with shape [-1,-1]
- //! that becomes a binding b in the engine. If the associated optimization profile
- //! specifies that b has minimum dimensions as [6,9] and maximum dimensions [7,9],
- //! getBindingDimensions(b) returns [-1,9], despite the second dimension being
- //! dynamic in the INetworkDefinition.
+ //! \brief Returns true if the serialization flag is set
//!
- //! Because each optimization profile has separate bindings, the returned value can
- //! differ across profiles. Consider another binding b' for the same network input,
- //! but for another optimization profile. If that other profile specifies minimum
- //! dimensions [5,8] and maximum dimensions [5,9], getBindingDimensions(b') returns [5,-1].
+ //! \see getFlags()
//!
- //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorShape().
+ //! \return True if flag is set, false if unset.
//!
- //! \see getTensorShape()
- //!
- TRT_DEPRECATED Dims getBindingDimensions(int32_t bindingIndex) const noexcept
+ bool getFlag(SerializationFlag serializationFlag) const noexcept
{
- return mImpl->getBindingDimensions(bindingIndex);
+ return mImpl->getFlag(serializationFlag);
}
+protected:
+ apiv::VSerializationConfig* mImpl;
+};
+
+//!
+//! \enum ExecutionContextAllocationStrategy
+//!
+//! \brief Different memory allocation behaviors for IExecutionContext.
+//!
+//! IExecutionContext requires a block of device memory for internal activation tensors during inference. The user can
+//! either let the execution context manage the memory in various ways or allocate the memory themselves.
+//!
+//! \see ICudaEngine::createExecutionContext()
+//! \see IExecutionContext::setDeviceMemory()
+//!
+enum class ExecutionContextAllocationStrategy : int32_t
+{
+ kSTATIC = 0, //!< Default static allocation with the maximum size across all profiles.
+ kON_PROFILE_CHANGE = 1, //!< Reallocate for a profile when it's selected.
+ kUSER_MANAGED = 2, //!< The user supplies custom allocation to the execution context.
+};
+
+//!
+//! \brief Maximum number of memory allocation strategies in ExecutionContextAllocationStrategy enum.
+//!
+//! \see ExecutionContextAllocationStrategy
+//!
+template <>
+constexpr inline int32_t EnumMax() noexcept
+{
+ return 3;
+}
+
+//!
+//! \class ICudaEngine
+//!
+//! \brief An engine for executing inference on a built network, with functionally unsafe features.
+//!
+//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
+//!
+class ICudaEngine : public INoCopy
+{
+public:
+ virtual ~ICudaEngine() noexcept = default;
+
//!
//! \brief Get shape of an input or output tensor.
//!
@@ -1674,21 +2483,6 @@ class ICudaEngine : public INoCopy
return mImpl->getTensorShape(tensorName);
}
- //!
- //! \brief Determine the required data type for a buffer from its binding index.
- //!
- //! \param bindingIndex The binding index.
- //! \return The type of the data in the buffer.
- //!
- //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorDataType().
- //!
- //! \see getTensorDataType()
- //!
- TRT_DEPRECATED DataType getBindingDataType(int32_t bindingIndex) const noexcept
- {
- return mImpl->getBindingDataType(bindingIndex);
- }
-
//!
//! \brief Determine the required data type for a buffer from its tensor name.
//!
@@ -1704,22 +2498,6 @@ class ICudaEngine : public INoCopy
return mImpl->getTensorDataType(tensorName);
}
- //!
- //! \brief Get the maximum batch size which can be used for inference. Should only be called if the engine is built
- //! from an INetworkDefinition with implicit batch dimension mode.
- //!
- //! \return The maximum batch size for this engine.
- //!
- //! \warning For an engine built from an INetworkDefinition with explicit batch dimension mode, this will always
- //! return 1.
- //!
- //! \deprecated Deprecated in TensorRT 8.4.
- //!
- TRT_DEPRECATED int32_t getMaxBatchSize() const noexcept
- {
- return mImpl->getMaxBatchSize();
- }
-
//!
//! \brief Get the number of layers in the network.
//!
@@ -1727,72 +2505,43 @@ class ICudaEngine : public INoCopy
//! may be combined or eliminated as the engine is optimized. This value can be useful when building per-layer
//! tables, such as when aggregating profiling data over a number of executions.
//!
- //! \return The number of layers in the network.
- //!
- int32_t getNbLayers() const noexcept
- {
- return mImpl->getNbLayers();
- }
-
- //!
- //! \brief Serialize the network to a stream.
- //!
- //! \return A IHostMemory object that contains the serialized engine.
- //!
- //! The network may be deserialized with IRuntime::deserializeCudaEngine().
- //!
- //! \see IRuntime::deserializeCudaEngine()
- //!
- IHostMemory* serialize() const noexcept
- {
- return mImpl->serialize();
- }
-
- //!
- //! \brief Create an execution context.
- //!
- //! The execution context created will call setOptimizationProfile(0) implicitly if there are
- //! no other execution contexts assigned to optimization profile 0. This functionality is
- //! deprecated in TensorRT 8.6 and will instead default all optimization profiles to 0 starting
- //! in TensorRT 9.0.
- //! If an error recorder has been set for the engine, it will also be passed to the execution context.
- //!
- //! \see IExecutionContext.
- //! \see IExecutionContext::setOptimizationProfile()
+ //! \return The number of layers in the network.
//!
- IExecutionContext* createExecutionContext() noexcept
+ int32_t getNbLayers() const noexcept
{
- return mImpl->createExecutionContext();
+ return mImpl->getNbLayers();
}
//!
- //! \brief Destroy this object;
+ //! \brief Serialize the network to a stream.
+ //!
+ //! \return A IHostMemory object that contains the serialized engine.
//!
- //! \deprecated Deprecated in TRT 8.0. Superseded by `delete`.
+ //! The network may be deserialized with IRuntime::deserializeCudaEngine().
//!
- //! \warning Calling destroy on a managed pointer will result in a double-free error.
+ //! \see IRuntime::deserializeCudaEngine()
//!
- TRT_DEPRECATED void destroy() noexcept
+ IHostMemory* serialize() const noexcept
{
- delete this;
+ return mImpl->serialize();
}
//!
- //! \brief Get location of binding
- //!
- //! This lets you know whether the binding should be a pointer to device or host memory.
- //!
- //! \param bindingIndex The binding index.
- //! \return The location of the bound tensor with given index.
+ //! \brief Create an execution context and specify the strategy for allocating internal activation memory.
//!
- //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorLocation().
+ //! The default value for the allocation strategy is ExecutionContextAllocationStrategy::kSTATIC, which means the
+ //! context will pre-allocate a block of device memory that is sufficient for all profiles. The newly created
+ //! execution context will be assigned optimization profile 0. If an error recorder has been set for the engine, it
+ //! will also be passed to the execution context.
//!
- //! \see ITensor::setLocation() ITensor::getLocation()
- //! \see getTensorLocation()
+ //! \see IExecutionContext
+ //! \see IExecutionContext::setOptimizationProfileAsync()
+ //! \see ExecutionContextAllocationStrategy
//!
- TRT_DEPRECATED TensorLocation getLocation(int32_t bindingIndex) const noexcept
+ IExecutionContext* createExecutionContext(
+ ExecutionContextAllocationStrategy strategy = ExecutionContextAllocationStrategy::kSTATIC) noexcept
{
- return mImpl->getLocation(bindingIndex);
+ return mImpl->createExecutionContext(strategy);
}
//!
@@ -1846,17 +2595,20 @@ class ICudaEngine : public INoCopy
return mImpl->getTensorIOMode(tensorName);
}
+ //!
//! \brief create an execution context without any device memory allocated
//!
//! The memory for execution of this device context must be supplied by the application.
//!
- IExecutionContext* createExecutionContextWithoutDeviceMemory() noexcept
+ //! \deprecated Deprecated in TensorRT 10.0. Superseded by createExecutionContext() with parameter.
+ //!
+ TRT_DEPRECATED IExecutionContext* createExecutionContextWithoutDeviceMemory() noexcept
{
return mImpl->createExecutionContextWithoutDeviceMemory();
}
//!
- //! \brief Return the amount of device memory required by an execution context.
+ //! \brief Return the maximum device memory required by the context over all profiles.
//!
//! \see IExecutionContext::setDeviceMemory()
//!
@@ -1866,30 +2618,23 @@ class ICudaEngine : public INoCopy
}
//!
- //! \brief Return true if an engine can be refit.
+ //! \brief Return the maximum device memory required by the context for a profile.
//!
- //! \see nvinfer1::createInferRefitter()
+ //! \see IExecutionContext::setDeviceMemory()
//!
- bool isRefittable() const noexcept
+ size_t getDeviceMemorySizeForProfile(int32_t profileIndex) const noexcept
{
- return mImpl->isRefittable();
+ return mImpl->getDeviceMemorySizeForProfile(profileIndex);
}
//!
- //! \brief Return the number of bytes per component of an element.
- //!
- //! The vector component size is returned if getBindingVectorizedDim() != -1.
- //!
- //! \param bindingIndex The binding Index.
- //!
- //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorBytesPerComponent().
+ //! \brief Return true if an engine can be refit.
//!
- //! \see getBindingVectorizedDim()
- //! \see getTensorBytesPerComponent()
+ //! \see nvinfer1::createInferRefitter()
//!
- TRT_DEPRECATED int32_t getBindingBytesPerComponent(int32_t bindingIndex) const noexcept
+ bool isRefittable() const noexcept
{
- return mImpl->getBindingBytesPerComponent(bindingIndex);
+ return mImpl->isRefittable();
}
//!
@@ -1931,22 +2676,6 @@ class ICudaEngine : public INoCopy
return mImpl->getTensorBytesPerComponentV2(tensorName, profileIndex);
}
- //!
- //! \brief Return the number of components included in one element.
- //!
- //! The number of elements in the vectors is returned if getBindingVectorizedDim() != -1.
- //!
- //! \param bindingIndex The binding Index.
- //!
- //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorComponentsPerElement().
- //!
- //! \see getBindingVectorizedDim()
- //!
- TRT_DEPRECATED int32_t getBindingComponentsPerElement(int32_t bindingIndex) const noexcept
- {
- return mImpl->getBindingComponentsPerElement(bindingIndex);
- }
-
//!
//! \brief Return the number of components included in one element, or -1 if the provided name does not map to an
//! input or output tensor.
@@ -1986,20 +2715,6 @@ class ICudaEngine : public INoCopy
return mImpl->getTensorComponentsPerElementV2(tensorName, profileIndex);
}
- //!
- //! \brief Return the binding format.
- //!
- //! \param bindingIndex The binding Index.
- //!
- //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorFormat().
- //!
- //! \see getTensorFormat()
- //!
- TRT_DEPRECATED TensorFormat getBindingFormat(int32_t bindingIndex) const noexcept
- {
- return mImpl->getBindingFormat(bindingIndex);
- }
-
//!
//! \brief Return the tensor format, or TensorFormat::kLINEAR if the provided name does not map to an input or
//! output tensor.
@@ -2029,30 +2744,6 @@ class ICudaEngine : public INoCopy
return mImpl->getTensorFormatV2(tensorName, profileIndex);
}
- //!
- //! \brief Return the human readable description of the tensor format, or nullptr if the provided name does not
- //! map to an input or output tensor.
- //!
- //! The description includes the order, vectorization, data type, and strides.
- //! Examples are shown as follows:
- //! Example 1: kCHW + FP32
- //! "Row major linear FP32 format"
- //! Example 2: kCHW2 + FP16
- //! "Two wide channel vectorized row major FP16 format"
- //! Example 3: kHWC8 + FP16 + Line Stride = 32
- //! "Channel major FP16 format where C % 8 == 0 and H Stride % 32 == 0"
- //!
- //! \param bindingIndex The binding Index.
- //!
- //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorFormatDesc().
- //!
- //! \see getTensorFormatDesc()
- //!
- TRT_DEPRECATED char const* getBindingFormatDesc(int32_t bindingIndex) const noexcept
- {
- return mImpl->getBindingFormatDesc(bindingIndex);
- }
-
//!
//! \brief Return the human readable description of the tensor format, or empty string if the provided name does not
//! map to an input or output tensor.
@@ -2060,9 +2751,9 @@ class ICudaEngine : public INoCopy
//! The description includes the order, vectorization, data type, and strides.
//! Examples are shown as follows:
//! Example 1: kCHW + FP32
- //! "Row major linear FP32 format"
+ //! "Row-major linear FP32 format"
//! Example 2: kCHW2 + FP16
- //! "Two wide channel vectorized row major FP16 format"
+ //! "Two-wide channel vectorized row-major FP16 format"
//! Example 3: kHWC8 + FP16 + Line Stride = 32
//! "Channel major FP16 format where C % 8 == 0 and H Stride % 32 == 0"
//!
@@ -2084,9 +2775,9 @@ class ICudaEngine : public INoCopy
//! The description includes the order, vectorization, data type, and strides.
//! Examples are shown as follows:
//! Example 1: kCHW + FP32
- //! "Row major linear FP32 format"
+ //! "Row-major linear FP32 format"
//! Example 2: kCHW2 + FP16
- //! "Two wide channel vectorized row major FP16 format"
+ //! "Two-wide channel vectorized row-major FP16 format"
//! Example 3: kHWC8 + FP16 + Line Stride = 32
//! "Channel major FP16 format where C % 8 == 0 and H Stride % 32 == 0"
//!
@@ -2100,22 +2791,6 @@ class ICudaEngine : public INoCopy
return mImpl->getTensorFormatDescV2(tensorName, profileIndex);
}
- //!
- //! \brief Return the dimension index that the buffer is vectorized, or -1 is the name is not found.
- //!
- //! Specifically -1 is returned if scalars per vector is 1.
- //!
- //! \param bindingIndex The binding Index.
- //!
- //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorVectorizedDim().
- //!
- //! \see getTensorVectorizedDim()
- //!
- TRT_DEPRECATED int32_t getBindingVectorizedDim(int32_t bindingIndex) const noexcept
- {
- return mImpl->getBindingVectorizedDim(bindingIndex);
- }
-
//!
//! \brief Return the dimension index that the buffer is vectorized, or -1 if the provided name does not
//! map to an input or output tensor.
@@ -2169,45 +2844,12 @@ class ICudaEngine : public INoCopy
//!
//! \return Number of optimization profiles. It is always at least 1.
//!
- //! \see IExecutionContext::setOptimizationProfile()
+ //! \see IExecutionContext::setOptimizationProfileAsync()
int32_t getNbOptimizationProfiles() const noexcept
{
return mImpl->getNbOptimizationProfiles();
}
- //!
- //! \brief Get the minimum / optimum / maximum dimensions for a particular input binding under an optimization
- //! profile.
- //!
- //! \param bindingIndex The input binding index, which must belong to the given profile,
- //! or be between 0 and bindingsPerProfile-1 as described below.
- //!
- //! \param profileIndex The profile index, which must be between 0 and getNbOptimizationProfiles()-1.
- //!
- //! \param select Whether to query the minimum, optimum, or maximum dimensions for this binding.
- //!
- //! \return The minimum / optimum / maximum dimensions for this binding in this profile.
- //! If the profileIndex or bindingIndex are invalid, return Dims with nbDims=-1.
- //!
- //! For backwards compatibility with earlier versions of TensorRT, if the bindingIndex
- //! does not belong to the current optimization profile, but is between 0 and bindingsPerProfile-1,
- //! where bindingsPerProfile = getNbBindings()/getNbOptimizationProfiles,
- //! then a corrected bindingIndex is used instead, computed by:
- //!
- //! profileIndex * bindingsPerProfile + bindingIndex % bindingsPerProfile
- //!
- //! Otherwise the bindingIndex is considered invalid.
- //!
- //! \deprecated Deprecated in TensorRT 8.5. Superseded by getProfileShape().
- //!
- //! \see getProfileShape()
- //!
- TRT_DEPRECATED Dims getProfileDimensions(
- int32_t bindingIndex, int32_t profileIndex, OptProfileSelector select) const noexcept
- {
- return mImpl->getProfileDimensions(bindingIndex, profileIndex, select);
- }
-
//!
//! \brief Get the minimum / optimum / maximum dimensions for an input tensor given its name under an optimization
//! profile.
@@ -2229,88 +2871,25 @@ class ICudaEngine : public INoCopy
}
//!
- //! \brief Get minimum / optimum / maximum values for an input shape binding under an optimization profile.
- //!
- //! \param profileIndex The profile index (must be between 0 and getNbOptimizationProfiles()-1)
- //!
- //! \param inputIndex The input index (must be between 0 and getNbBindings() - 1)
- //!
- //! \param select Whether to query the minimum, optimum, or maximum shape values for this binding.
- //!
- //! \return If the binding is an input shape binding, return a pointer to an array that has
- //! the same number of elements as the corresponding tensor, i.e. 1 if dims.nbDims == 0, or dims.d[0]
- //! if dims.nbDims == 1, where dims = getBindingDimensions(inputIndex). The array contains
- //! the elementwise minimum / optimum / maximum values for this shape binding under the profile.
- //! If either of the indices is out of range, or if the binding is not an input shape binding, return
- //! nullptr.
- //!
- //! For backwards compatibility with earlier versions of TensorRT, a bindingIndex that does not belong
- //! to the profile is corrected as described for getProfileDimensions().
- //!
- //! \deprecated Deprecated in TensorRT 8.5. Superseded by getShapeValues(). Difference between Execution and shape
- //! tensor is superficial since TensorRT 8.5.
- //!
- //! \see getProfileDimensions() getShapeValues()
- //!
- TRT_DEPRECATED int32_t const* getProfileShapeValues(
- int32_t profileIndex, int32_t inputIndex, OptProfileSelector select) const noexcept
- {
- return mImpl->getProfileShapeValues(profileIndex, inputIndex, select);
- }
-
- //!
- //! \brief True if tensor is required as input for shape calculations or output from them.
- //!
- //! TensorRT evaluates a network in two phases:
- //!
- //! 1. Compute shape information required to determine memory allocation requirements
- //! and validate that runtime sizes make sense.
- //!
- //! 2. Process tensors on the device.
- //!
- //! Some tensors are required in phase 1. These tensors are called "shape tensors", and always
- //! have type Int32 and no more than one dimension. These tensors are not always shapes
- //! themselves, but might be used to calculate tensor shapes for phase 2.
- //!
- //! isShapeBinding(i) returns true if the tensor is a required input or an output computed in phase 1.
- //! isExecutionBinding(i) returns true if the tensor is a required input or an output computed in phase 2.
- //!
- //! For example, if a network uses an input tensor with binding i as an addend
- //! to an IElementWiseLayer that computes the "reshape dimensions" for IShuffleLayer,
- //! then isShapeBinding(i) == true.
+ //! \brief Get the minimum / optimum / maximum values (not dimensions) for an input tensor given
+ //! its name under an optimization profile. These correspond to the values set using
+ //! IOptimizationProfile::setShapeValues when the engine was built.
//!
- //! It's possible to have a tensor be required by both phases. For instance, a tensor
- //! can be used for the "reshape dimensions" and as the indices for an IGatherLayer
- //! collecting floating-point data.
- //!
- //! It's also possible to have a tensor be required by neither phase, but nonetheless
- //! shows up in the engine's inputs. For example, if an input tensor is used only
- //! as an input to IShapeLayer, only its shape matters and its values are irrelevant.
- //!
- //! \deprecated Use name-based isShapeInferenceIO() instead to know whether a tensor is a shape tensor.
- //!
- //! \see isExecutionBinding() isShapeInferenceIO()
- //!
- TRT_DEPRECATED bool isShapeBinding(int32_t bindingIndex) const noexcept
- {
- return mImpl->isShapeBinding(bindingIndex);
- }
-
+ //! \param tensorName The name of an input tensor.
//!
- //! \brief True if pointer to tensor data is required for execution phase, false if nullptr can be supplied.
+ //! \param profileIndex The profile index, which must be between 0 and getNbOptimizationProfiles()-1.
//!
- //! For example, if a network uses an input tensor with binding i ONLY as the "reshape dimensions"
- //! input of IShuffleLayer, then isExecutionBinding(i) is false, and a nullptr can be
- //! supplied for it when calling IExecutionContext::execute or IExecutionContext::enqueue.
+ //! \param select Whether to query the minimum, optimum, or maximum values for this input tensor.
//!
- //! \deprecated No name-based equivalent replacement. Use getTensorLocation() instead to know the location of tensor
- //! data. Distinction between execution binding and shape binding is superficial since TensorRT 8.5.
+ //! \return The minimum / optimum / maximum values for an input tensor in this profile.
+ //! If the profileIndex is invalid or the provided name does not map to an input tensor, return nullptr.
//!
- //! \see isShapeBinding() getTensorLocation()
+ //! \warning The string tensorName must be null-terminated, and be at most 4096 bytes including the terminator.
//!
- TRT_DEPRECATED bool isExecutionBinding(int32_t bindingIndex) const noexcept
+ int32_t const* getProfileTensorValues(char const* tensorName, int32_t profileIndex, OptProfileSelector select) const
+ noexcept
{
- return mImpl->isExecutionBinding(bindingIndex);
+ return mImpl->getProfileTensorValues(tensorName, profileIndex, select);
}
//!
@@ -2318,8 +2897,8 @@ class ICudaEngine : public INoCopy
//!
//! If the engine has EngineCapability::kSTANDARD, then all engine functionality is valid.
//! If the engine has EngineCapability::kSAFETY, then only the functionality in safe engine is valid.
- //! If the engine has EngineCapability::kDLA_STANDALONE, then only serialize, destroy, and const-accessor functions are
- //! valid.
+ //! If the engine has EngineCapability::kDLA_STANDALONE, then only serialize, destroy, and const-accessor functions
+ //! are valid.
//!
//! \return The EngineCapability flag that the engine was built for.
//!
@@ -2328,6 +2907,7 @@ class ICudaEngine : public INoCopy
return mImpl->getEngineCapability();
}
+ //!
//! \brief Set the ErrorRecorder for this interface
//!
//! Assigns the ErrorRecorder to this interface. The ErrorRecorder will track all errors during execution.
@@ -2338,7 +2918,7 @@ class ICudaEngine : public INoCopy
//! If an error recorder is not set, messages will be sent to the global log stream.
//!
//! \param recorder The error recorder to register with this interface.
- //
+ //!
//! \see getErrorRecorder()
//!
void setErrorRecorder(IErrorRecorder* recorder) noexcept
@@ -2364,22 +2944,18 @@ class ICudaEngine : public INoCopy
//!
//! \brief Query whether the engine was built with an implicit batch dimension.
//!
- //! \return True if tensors have implicit batch dimension, false otherwise.
- //!
- //! This is an engine-wide property. Either all tensors in the engine
- //! have an implicit batch dimension or none of them do.
- //!
- //! hasImplicitBatchDimension() is true if and only if the INetworkDefinition
- //! from which this engine was built was created with createNetworkV2() without
- //! NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag.
+ //! \return Always false since TensorRT 10.0 does not support an implicit batch dimension.
//!
//! \see createNetworkV2
//!
- bool hasImplicitBatchDimension() const noexcept
+ //! \deprecated Deprecated in TensorRT 10.0. Implicit batch is no supported since TensorRT 10.0.
+ //!
+ TRT_DEPRECATED bool hasImplicitBatchDimension() const noexcept
{
return mImpl->hasImplicitBatchDimension();
}
+ //!
//! \brief return the tactic sources required by this engine.
//!
//! The value returned is equal to zero or more tactics sources set
@@ -2395,6 +2971,7 @@ class ICudaEngine : public INoCopy
return mImpl->getTacticSources();
}
+ //!
//! \brief Return the \ref ProfilingVerbosity the builder config was set to when the engine was built.
//!
//! \return the profiling verbosity the builder config was set to when the engine was built.
@@ -2441,6 +3018,7 @@ class ICudaEngine : public INoCopy
return mImpl->getIOTensorName(index);
}
+ //!
//! \brief Return the hardware compatibility level of this engine.
//!
//! \return hardwareCompatibilityLevel The level of hardware
@@ -2468,36 +3046,166 @@ class ICudaEngine : public INoCopy
return mImpl->getNbAuxStreams();
}
+ //!
+ //! \brief Create a serialization configuration object.
+ //!
+ //! \see ISerializationConfig
+ //!
+ ISerializationConfig* createSerializationConfig() noexcept
+ {
+ return mImpl->createSerializationConfig();
+ }
+
+ //!
+ //! \brief Serialize the network to a stream with the provided SerializationConfig.
+ //!
+ //! \return An IHostMemory object that contains the serialized engine.
+ //!
+ //! The network may be deserialized with IRuntime::deserializeCudaEngine().
+ //!
+ //! \see IRuntime::deserializeCudaEngine()
+ //!
+ IHostMemory* serializeWithConfig(ISerializationConfig& config) const noexcept
+ {
+ return mImpl->serializeWithConfig(config);
+ }
+
+ //!
+ //! \brief Limit the maximum amount of GPU memory usable for network weights
+ //! in bytes.
+ //!
+ //! \param gpuMemoryBudget This parameter may take on 3 types of values:
+ //! -1: Allows TensorRT to choose the budget according to the streamable weights size.
+ //! Free CUDA memory will be queried at ::createExecutionContext and accordingly:
+ //! * If streamable weights all fit: weight streaming is not required and disabled.
+ //! * Otherwise: Budget is set to getMinimumWeightStreamingBudget
+ //! 0: (default) Disables weight streaming. The execution may fail if the network is too large for GPU memory.
+ //! >0: The maximum bytes of GPU memory that weights can occupy. It must be bounded by
+ //! [getMinimumWeightStreamingBudget, min(getStreamableWeightsSize - 1, free GPU memory)].
+ //!
+ //! By setting a weight limit, users can expect a GPU memory usage reduction
+ //! of |network weights| - gpuMemoryBudget bytes. Maximum memory savings occur
+ //! when gpuMemoryBudget is set to getMinimumWeightStreamingBudget.
+ //!
+ //! Streaming larger amounts of memory will likely result in lower performance
+ //! except in some boundary cases where streaming weights allows the user to
+ //! run larger batch sizes. The higher throughput offsets the increased
+ //! latency in these cases. Tuning the value of the memory limit is
+ //! recommended for best performance.
+ //!
+ //! \warning If weight streaming is active, then multiple concurrent IExecutionContexts will forced to run serially.
+ //!
+ //! \warning GPU memory for the weights is allocated upon the first IExecutionContext's creation
+ //! and deallocated upon the last one's destruction.
+ //!
+ //! \warning BuilderFlag::kWEIGHT_STREAMING must be set during engine building.
+ //!
+ //! \return true if the memory limit is valid and the call was successful
+ //! otherwise false.
+ //!
+ //! \see BuilderFlag::kWEIGHT_STREAMING,
+ //! ICudaEngine::getWeightStreamingBudget
+ //! ICudaEngine::getMinimumWeightStreamingBudget,
+ //! ICudaEngine::getStreamableWeightsSize
+ //!
+ bool setWeightStreamingBudget(int64_t gpuMemoryBudget) noexcept
+ {
+ return mImpl->setWeightStreamingBudget(gpuMemoryBudget);
+ }
+
+ //!
+ //! \brief Returns the current weight streaming device memory budget in bytes.
+ //!
+ //! \warning BuilderFlag::kWEIGHT_STREAMING must be set during engine building.
+ //!
+ //! \returns The weight streaming budget in bytes. Please see ::setWeightStreamingBudget for the possible
+ //! values.
+ //!
+ //! \see BuilderFlag::kWEIGHT_STREAMING,
+ //! ICudaEngine::setWeightStreamingBudget,
+ //! ICudaEngine::getMinimumWeightStreamingBudget,
+ //! ICudaEngine::getStreamableWeightsSize
+ //!
+ int64_t getWeightStreamingBudget() const noexcept
+ {
+ return mImpl->getWeightStreamingBudget();
+ }
+
+ //!
+ //! \brief The minimum number of bytes of GPU memory required by network
+ //! weights for successful weight streaming.
+ //!
+ //! This is a positive integer for engines with streamable weights because a
+ //! staging buffer on the GPU is required to temporarily hold the streamed
+ //! weights. The size of the staging buffer is determined by TensorRT and must
+ //! be at least as large as the size of the largest streamable weight in the
+ //! network.
+ //!
+ //! \warning BuilderFlag::kWEIGHT_STREAMING must be set during engine building.
+ //!
+ //!
+ //! \returns The minimum number of bytes of GPU memory required for streaming.
+ //!
+ //! \see ICudaEngine::setWeightStreamingBudget
+ //!
+ int64_t getMinimumWeightStreamingBudget() const noexcept
+ {
+ return mImpl->getMinimumWeightStreamingBudget();
+ }
+
+ //!
+ //! \brief Get the total size in bytes of all streamable weights.
+ //!
+ //! The set of streamable weights is a subset of all network weights. The
+ //! total size may exceed free GPU memory.
+ //!
+ //! Returns 0 if BuilderFlag::kWEIGHT_STREAMING is unset during engine building.
+ //!
+ //!
+ //! \returns The total size in bytes of all streamable weights.
+ //!
+ //! \see ICudaEngine::setWeightStreamingBudget
+ //!
+ int64_t getStreamableWeightsSize() const noexcept
+ {
+ return mImpl->getStreamableWeightsSize();
+ }
+
+ //!
+ //! \brief Check if a tensor is marked as a debug tensor.
+ //!
+ //! Determine whether the given name corresponds to a debug tensor.
+ //!
+ //! \returns True if tensor is a debug tensor, false otherwise.
+ //!
+ //! \see INetworkDefinition::markDebug
+ //!
+ bool isDebugTensor(char const* name) const noexcept
+ {
+ return mImpl->isDebugTensor(name);
+ }
+
protected:
apiv::VCudaEngine* mImpl;
};
-//!
-//! \class IOutputAllocator
-//!
-//! \brief Callback from ExecutionContext::enqueueV3()
-//!
-//! Clients should override the method reallocateOutput.
-//!
-//! \see IExecutionContext::enqueueV3()
-//!
-class IOutputAllocator
+namespace v_1_0
+{
+class IOutputAllocator : public IVersionedInterface
{
public:
//!
- //! \brief Return the API version of this IOutputAllocator.
- //!
- //! Do not override this method as it is used by the TensorRT library to maintain
- //! backwards-compatibility with IOutputAllocator. The value will change if Nvidia
- //! adds additional virtual methods to this class.
+ //! \brief Return version information associated with this interface. Applications must not override this method.
//!
- virtual int32_t getInterfaceVersion() const noexcept
+ InterfaceInfo getInterfaceInfo() const noexcept override
{
- return 1;
+ return {"IOutputAllocator", 1, 0};
}
//!
//! \brief Return a pointer to memory for an output tensor, or nullptr if memory cannot be allocated.
+ //! If the requested memory size exceeds the currentMemory size, the currentMemory can be freed as well.
+ //! If currentMemory is known to be big enough, one option is to return currentMemory.
//!
//! \param tensorName name of the output tensor.
//! \param currentMemory points to the address set by IExectionContext::setTensorAddress.
@@ -2506,13 +3214,45 @@ class IOutputAllocator
//!
//! \return A pointer to memory to use for the output tensor or nullptr.
//!
- //! If currentMemory is known to be big enough, one option is to return currentMemory.
- //!
//! To preallocate memory and have the engine fail if the preallocation is not big enough,
//! use IExecutionContext::setTensorAddress to set a pointer to the preallocated memory,
//! and have reallocateOutput return nullptr if that memory is not big enough.
//!
- virtual void* reallocateOutput(char const* tensorName, void* currentMemory, uint64_t size, uint64_t alignment) noexcept = 0;
+ //! \deprecated Deprecated in TensorRT 10.0. Superseded by reallocateOutputAsync with cudaStream_t argument
+ //!
+ TRT_DEPRECATED virtual void* reallocateOutput(
+ char const* tensorName, void* currentMemory, uint64_t size, uint64_t alignment) noexcept
+ {
+ return nullptr;
+ }
+
+ //!
+ //! \brief Return a pointer to memory for an output tensor, or nullptr if memory cannot be allocated.
+ //! If the requested memory size exceeds the currentMemory size, the currentMemory can be freed as well.
+ //! If currentMemory is known to be big enough, one option is to return currentMemory.
+ //!
+ //! \param tensorName name of the output tensor.
+ //! \param currentMemory points to the address set by IExectionContext::setTensorAddress.
+ //! \param size number of bytes required. Always positive, even for an empty tensor.
+ //! \param alignment required alignment of the allocation.
+ //! \param stream The stream in which to execute the kernels.
+ //!
+ //! \return A pointer to memory to use for the output tensor or nullptr.
+ //!
+ //! To preallocate memory and have the engine fail if the preallocation is not big enough,
+ //! use IExecutionContext::setTensorAddress to set a pointer to the preallocated memory,
+ //! and have reallocateOutputAsync return nullptr if that memory is not big enough.
+ //!
+ //! The default definition exists for sake of backward compatibility with earlier versions of TensorRT.
+ //! Eventually this method will become a pure virtual method that requires an override, and method
+ //! reallocateOutput() will disappear. Code moving away from TensorRT 9.x should override method
+ //! reallocateOutputAsync() and NOT override method reallocateOutput().
+ //!
+ virtual void* reallocateOutputAsync(
+ char const* tensorName, void* currentMemory, uint64_t size, uint64_t alignment, cudaStream_t /*stream*/)
+ {
+ return reallocateOutput(tensorName, currentMemory, size, alignment);
+ }
//!
//! \brief Called by TensorRT when the shape of the output tensor is known.
@@ -2523,92 +3263,79 @@ class IOutputAllocator
//! \param tensorName name of the tensor
//!
virtual void notifyShape(char const* tensorName, Dims const& dims) noexcept = 0;
-
- virtual ~IOutputAllocator() = default;
};
+} // namespace v_1_0
//!
-//! \class IExecutionContext
+//! \class IOutputAllocator
//!
-//! \brief Context for executing inference using an engine, with functionally unsafe features.
+//! \brief Callback from ExecutionContext::enqueueV3()
//!
-//! Multiple execution contexts may exist for one ICudaEngine instance, allowing the same
-//! engine to be used for the execution of multiple batches simultaneously. If the engine supports
-//! dynamic shapes, each execution context in concurrent use must use a separate optimization profile.
+//! \see IExecutionContext::enqueueV3()
//!
-//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
-class IExecutionContext : public INoCopy
+using IOutputAllocator = v_1_0::IOutputAllocator;
+
+namespace v_1_0
+{
+class IDebugListener : public IVersionedInterface
{
public:
- virtual ~IExecutionContext() noexcept = default;
-
- //!
- //! \brief Synchronously execute inference on a batch.
- //!
- //! This method requires an array of input and output buffers. The mapping from tensor names to indices
- //! can be queried using ICudaEngine::getBindingIndex()
- //!
- //! \param batchSize The batch size. This is at most the max batch size value supplied to the builder when the
- //! engine was built. If the network is created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag, please use
- //! executeV2() instead, and this batchSize argument has no effect.
- //! \param bindings An array of pointers to input and output buffers for the network.
//!
- //! \return True if execution succeeded.
- //!
- //! \deprecated Deprecated in TensorRT 8.4. Superseded by executeV2() if the network is created with
- //! NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag.
- //!
- //! \warning This function will trigger layer resource updates if hasImplicitBatchDimension()
- //! returns true and batchSize changes between subsequent calls, possibly resulting
- //! in performance bottlenecks.
- //!
- //! \see ICudaEngine::getBindingIndex() ICudaEngine::getMaxBatchSize()
+ //! \brief Return version information associated with this interface. Applications must not override this method.
//!
- TRT_DEPRECATED bool execute(int32_t batchSize, void* const* bindings) noexcept
+ InterfaceInfo getInterfaceInfo() const noexcept override
{
- return mImpl->execute(batchSize, bindings);
+ return {"IDebugListener", 1, 0};
}
//!
- //! \brief Enqueue inference of a batch on a stream.
- //!
- //! This method requires an array of input and output buffers. The mapping from tensor names to indices can be
- //! queried using ICudaEngine::getBindingIndex()
- //!
- //! \param batchSize The batch size. This is at most the max batch size value supplied to the builder when the
- //! engine was built. If the network is created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag, please use
- //! enqueueV3() instead, and this batchSize argument has no effect.
- //! \param bindings An array of pointers to input and output buffers for the network.
- //! \param stream A cuda stream on which the inference kernels will be enqueued.
- //! \param inputConsumed An optional event which will be signaled when the input buffers can be refilled with new
- //! data.
- //!
- //! \return True if the kernels were enqueued successfully.
- //!
- //! \deprecated Deprecated in TensorRT 8.4. Superseded by enqueueV2() if the network is created with
- //! NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag.
+ //! \brief Callback function that is called when a debug tensor’s value is updated and the debug state of the tensor
+ //! is set to true. Content in the given address is only guaranteed to be valid for the duration of the callback.
//!
- //! \see ICudaEngine::getBindingIndex() ICudaEngine::getMaxBatchSize()
+ //! \param location TensorLocation of the tensor.
+ //! \param addr pointer to buffer.
+ //! \param type data Type of the tensor.
+ //! \param shape shape of the tensor.
+ //! \param name name of the tensor.
+ //! \param stream Cuda stream object.
//!
- //! \warning Calling enqueue() in from the same IExecutionContext object with different CUDA streams concurrently
- //! results in undefined behavior. To perform inference concurrently in multiple streams, use one execution
- //! context per stream.
+ //! \return True on success, false otherwise.
//!
- //! \warning This function will trigger layer resource updates if hasImplicitBatchDimension()
- //! returns true and batchSize changes between subsequent calls, possibly resulting in performance
- //! bottlenecks.
- //!
- TRT_DEPRECATED bool enqueue(
- int32_t batchSize, void* const* bindings, cudaStream_t stream, cudaEvent_t* inputConsumed) noexcept
- {
- return mImpl->enqueue(batchSize, bindings, stream, inputConsumed);
- }
+ virtual bool processDebugTensor(void const* addr, TensorLocation location, DataType type, Dims const& shape,
+ char const* name, cudaStream_t stream)
+ = 0;
+
+ ~IDebugListener() override = default;
+};
+} // namespace v_1_0
+
+//!
+//! \class IDebugListener
+//!
+//! \brief User-implemented callback for notification when value of a debug tensor is updated.
+//!
+using IDebugListener = v_1_0::IDebugListener;
+
+//!
+//! \class IExecutionContext
+//!
+//! \brief Context for executing inference using an engine, with functionally unsafe features.
+//!
+//! Multiple execution contexts may exist for one ICudaEngine instance, allowing the same
+//! engine to be used for the execution of multiple batches simultaneously. If the engine supports
+//! dynamic shapes, each execution context in concurrent use must use a separate optimization profile.
+//!
+//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
+class IExecutionContext : public INoCopy
+{
+public:
+ virtual ~IExecutionContext() noexcept = default;
//!
//! \brief Set the debug sync flag.
//!
//! If this flag is set to true, the engine will log the successful execution for each kernel during executeV2(). It
- //! has no effect when using enqueueV2()/enqueueV3().
+ //! has no effect when using enqueueV3().
//!
//! \see getDebugSync()
//!
@@ -2657,18 +3384,6 @@ class IExecutionContext : public INoCopy
return mImpl->getEngine();
}
- //!
- //! \brief Destroy this object.
- //!
- //! \deprecated Deprecated in TRT 8.0. Superseded by `delete`.
- //!
- //! \warning Calling destroy on a managed pointer will result in a double-free error.
- //!
- TRT_DEPRECATED void destroy() noexcept
- {
- delete this;
- }
-
//!
//! \brief Set the name of the execution context.
//!
@@ -2697,42 +3412,23 @@ class IExecutionContext : public INoCopy
//! \brief Set the device memory for use by this execution context.
//!
//! The memory must be aligned with cuda memory alignment property (using cudaGetDeviceProperties()), and its size
- //! must be at least that returned by getDeviceMemorySize(). Setting memory to nullptr is acceptable if
- //! getDeviceMemorySize() returns 0. If using enqueueV2()/enqueueV3() to run the network, the memory is in use from
- //! the invocation of enqueueV2()/enqueueV3() until network execution is complete. If using executeV2(), it is in
- //! use until executeV2() returns. Releasing or otherwise using the memory for other purposes during this time will
- //! result in undefined behavior.
- //!
- //! \see ICudaEngine::getDeviceMemorySize() ICudaEngine::createExecutionContextWithoutDeviceMemory()
+ //! must be large enough for performing inference with the given network inputs. getDeviceMemorySize() and
+ //! getDeviceMemorySizeForProfile() report upper bounds of the size. Setting memory to nullptr is acceptable if the
+ //! reported size is 0. If using enqueueV3() to run the network, the memory is in use from the invocation of
+ //! enqueueV3() until network execution is complete. If using executeV2(), it is in use until executeV2() returns.
+ //! Releasing or otherwise using the memory for other purposes during this time will result in undefined behavior.
+ //!
+ //! \see ICudaEngine::getDeviceMemorySize()
+ //! \see ICudaEngine::getDeviceMemorySizeForProfile()
+ //! \see ExecutionContextAllocationStrategy
+ //! \see ICudaEngine::createExecutionContext()
+ //! \see ICudaEngine::createExecutionContextWithoutDeviceMemory()
//!
void setDeviceMemory(void* memory) noexcept
{
mImpl->setDeviceMemory(memory);
}
- //!
- //! \brief Return the strides of the buffer for the given binding.
- //!
- //! The strides are in units of elements, not components or bytes.
- //! For example, for TensorFormat::kHWC8, a stride of one spans 8 scalars.
- //!
- //! Note that strides can be different for different execution contexts
- //! with dynamic shapes.
- //!
- //! If the bindingIndex is invalid or there are dynamic dimensions that have not been
- //! set yet, returns Dims with Dims::nbDims = -1.
- //!
- //! \param bindingIndex The binding index.
- //!
- //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorStrides().
- //!
- //! \see getTensorStrides()
- //!
- TRT_DEPRECATED Dims getStrides(int32_t bindingIndex) const noexcept
- {
- return mImpl->getStrides(bindingIndex);
- }
-
//!
//! \brief Return the strides of the buffer for the given tensor name.
//!
@@ -2755,50 +3451,13 @@ class IExecutionContext : public INoCopy
}
public:
- //!
- //! \brief Select an optimization profile for the current context.
- //!
- //! \param profileIndex Index of the profile. It must lie between 0 and
- //! getEngine().getNbOptimizationProfiles() - 1
- //!
- //! The selected profile will be used in subsequent calls to executeV2()/enqueueV2()/enqueueV3().
- //!
- //! When an optimization profile is switched via this API, TensorRT may
- //! enqueue GPU memory copy operations required to set up the new profile during the subsequent
- //! enqueueV2()/enqueueV3() operations. To avoid these calls during enqueueV2()/enqueueV3(), use
- //! setOptimizationProfileAsync() instead.
- //!
- //! If the associated CUDA engine does not have inputs with dynamic shapes, this method need not be
- //! called, in which case the default profile index of 0 will be used (this is particularly
- //! the case for all safe engines).
- //!
- //! setOptimizationProfile() must be called before calling setBindingDimensions() and
- //! setInputShapeBinding() for all dynamic input tensors or input shape tensors, which in
- //! turn must be called before executeV2()/enqueueV2()/enqueueV3().
- //!
- //! \warning This function will trigger layer resource updates on the next
- //! call of enqueueV2()/enqueueV3()/executeV2(), possibly resulting in performance bottlenecks.
- //!
- //! \return true if the call succeeded, else false (e.g. input out of range)
- //!
- //! \deprecated Superseded by setOptimizationProfileAsync. Deprecated prior to TensorRT 8.0 and will be
- //! removed in 9.0.
- //!
- //! \see ICudaEngine::getNbOptimizationProfiles() IExecutionContext::setOptimizationProfileAsync()
- //!
- TRT_DEPRECATED
- bool setOptimizationProfile(int32_t profileIndex) noexcept
- {
- return mImpl->setOptimizationProfile(profileIndex);
- }
-
//!
//! \brief Get the index of the currently selected optimization profile.
//!
//! If the profile index has not been set yet (implicitly to 0 if no other execution context has been set to
//! profile 0, or explicitly for all subsequent contexts), an invalid value of -1 will be returned
- //! and all calls to enqueueV2()/enqueueV3()/executeV2() will fail until a valid profile index has been set.
- //! This behavior is deprecated in TensorRT 8.6 and in TensorRT 9.0, all profiles will default to optimization
+ //! and all calls to enqueueV3()/executeV2() will fail until a valid profile index has been set.
+ //! This behavior is deprecated in TensorRT 8.6, all profiles will default to optimization
//! profile 0 and -1 will no longer be returned.
//!
int32_t getOptimizationProfile() const noexcept
@@ -2806,45 +3465,6 @@ class IExecutionContext : public INoCopy
return mImpl->getOptimizationProfile();
}
- //!
- //! \brief Set the dynamic dimensions of an input binding.
- //!
- //! \param bindingIndex index of an input tensor whose dimensions must be compatible with
- //! the network definition (i.e. only the wildcard dimension -1 can be replaced with a
- //! new dimension >= 0).
- //!
- //! \param dimensions specifies the dimensions of the input tensor. It must be in the valid
- //! range for the currently selected optimization profile, and the corresponding engine must
- //! not be safety-certified.
- //!
- //! This method requires the engine to be built without an implicit batch dimension.
- //! This method will fail unless a valid optimization profile is defined for the current
- //! execution context (getOptimizationProfile() must not be -1).
- //!
- //! For all dynamic non-output bindings (which have at least one wildcard dimension of -1),
- //! this method needs to be called before either enqueueV2() or executeV2() may be called.
- //! This can be checked using the method allInputDimensionsSpecified().
- //!
- //! \warning This function will trigger layer resource updates on the next
- //! call of enqueueV2()/executeV2(), possibly resulting in performance bottlenecks,
- //! if the dimensions are different than the previous set dimensions.
- //!
- //! \return false if an error occurs (e.g. bindingIndex is out of range for the currently selected
- //! optimization profile or binding dimension is inconsistent with min-max range of the
- //! optimization profile), else true. Note that the network can still be invalid for certain
- //! combinations of input shapes that lead to invalid output shapes. To confirm the correctness
- //! of the network input shapes, check whether the output binding has valid
- //! dimensions using getBindingDimensions() on the output bindingIndex.
- //!
- //! \deprecated Deprecated in TensorRT 8.5. Superseded by setInputShape().
- //!
- //! \see setInputShape()
- //!
- TRT_DEPRECATED bool setBindingDimensions(int32_t bindingIndex, Dims dimensions) noexcept
- {
- return mImpl->setBindingDimensions(bindingIndex, dimensions);
- }
-
//!
//! \brief Set shape of given input.
//!
@@ -2863,39 +3483,6 @@ class IExecutionContext : public INoCopy
return mImpl->setInputShape(tensorName, dims);
}
- //!
- //! \brief Get the dynamic dimensions of a binding.
- //!
- //! If the engine was built with an implicit batch dimension, same as ICudaEngine::getBindingDimensions.
- //!
- //! If setBindingDimensions() has been called on this binding (or if there are no
- //! dynamic dimensions), all dimensions will be positive. Otherwise, it is necessary to
- //! call setBindingDimensions() before enqueueV2() or executeV2() may be called.
- //!
- //! If the bindingIndex is out of range, an invalid Dims with nbDims == -1 is returned.
- //! The same invalid Dims will be returned if the engine was not built with an implicit
- //! batch dimension and if the execution context is not currently associated with a valid
- //! optimization profile (i.e. if getOptimizationProfile() returns -1).
- //!
- //! If ICudaEngine::bindingIsInput(bindingIndex) is false, then both
- //! allInputDimensionsSpecified() and allInputShapesSpecified() must be true
- //! before calling this method.
- //!
- //! \return Currently selected binding dimensions
- //!
- //! For backwards compatibility with earlier versions of TensorRT, a bindingIndex that does not belong
- //! to the current profile is corrected as described for ICudaEngine::getProfileDimensions.
- //!
- //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorShape().
- //!
- //! \see ICudaEngine::getProfileDimensions()
- //! \see getTensorShape()
- //!
- TRT_DEPRECATED Dims getBindingDimensions(int32_t bindingIndex) const noexcept
- {
- return mImpl->getBindingDimensions(bindingIndex);
- }
-
//!
//! \brief Return the shape of the given input or output.
//!
@@ -2933,78 +3520,17 @@ class IExecutionContext : public INoCopy
return mImpl->getTensorShape(tensorName);
}
- //!
- //! \brief Set values of input tensor required by shape calculations.
- //!
- //! \param bindingIndex index of an input tensor for which
- //! ICudaEngine::isShapeBinding(bindingIndex) and ICudaEngine::bindingIsInput(bindingIndex)
- //! are both true.
- //!
- //! \param data pointer to values of the input tensor. The number of values should be
- //! the product of the dimensions returned by getBindingDimensions(bindingIndex).
- //!
- //! If ICudaEngine::isShapeBinding(bindingIndex) and ICudaEngine::bindingIsInput(bindingIndex)
- //! are both true, this method must be called before enqueueV2() or executeV2() may be called.
- //! This method will fail unless a valid optimization profile is defined for the current
- //! execution context (getOptimizationProfile() must not be -1).
- //!
- //! \warning This function will trigger layer resource updates on the next call of
- //! enqueueV2()/executeV2(), possibly resulting in performance bottlenecks, if the
- //! shapes are different than the previous set shapes.
- //!
- //! \return false if an error occurs (e.g. bindingIndex is out of range for the currently selected
- //! optimization profile or shape data is inconsistent with min-max range of the
- //! optimization profile), else true. Note that the network can still be invalid for certain
- //! combinations of input shapes that lead to invalid output shapes. To confirm the correctness
- //! of the network input shapes, check whether the output binding has valid
- //! dimensions using getBindingDimensions() on the output bindingIndex.
- //!
- //! \deprecated Deprecated in TensorRT 8.5. Superseded by setInputTensorAddress() or setTensorAddress().
- //!
- //! \see setInputTensorAddress() setTensorAddress()
- //!
- TRT_DEPRECATED bool setInputShapeBinding(int32_t bindingIndex, int32_t const* data) noexcept
- {
- return mImpl->setInputShapeBinding(bindingIndex, data);
- }
-
- //!
- //! \brief Get values of an input tensor required for shape calculations or an output tensor produced by shape
- //! calculations.
- //!
- //! \param bindingIndex index of an input or output tensor for which
- //! ICudaEngine::isShapeBinding(bindingIndex) is true.
- //!
- //! \param data pointer to where values will be written. The number of values written is
- //! the product of the dimensions returned by getBindingDimensions(bindingIndex).
- //!
- //! If ICudaEngine::bindingIsInput(bindingIndex) is false, then both
- //! allInputDimensionsSpecified() and allInputShapesSpecified() must be true
- //! before calling this method. The method will also fail if no valid optimization profile
- //! has been set for the current execution context, i.e. if getOptimizationProfile() returns -1.
- //!
- //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorAddress() or getOutputTensorAddress().
- //!
- //! \see isShapeBinding() getTensorAddress() getOutputTensorAddress()
- //!
- TRT_DEPRECATED bool getShapeBinding(int32_t bindingIndex, int32_t* data) const noexcept
- {
- return mImpl->getShapeBinding(bindingIndex, data);
- }
-
//!
//! \brief Whether all dynamic dimensions of input tensors have been specified
//!
//! \return True if all dynamic dimensions of input tensors have been specified
- //! by calling setBindingDimensions().
+ //! by calling setInputShape().
//!
//! Trivially true if network has no dynamically shaped input tensors.
//!
//! Does not work with name-base interfaces eg. IExecutionContext::setInputShape(). Use
//! IExecutionContext::inferShapes() instead.
//!
- //! \see setBindingDimensions(bindingIndex,dimensions)
- //!
bool allInputDimensionsSpecified() const noexcept
{
return mImpl->allInputDimensionsSpecified();
@@ -3020,9 +3546,9 @@ class IExecutionContext : public INoCopy
//! Does not work with name-base interfaces eg. IExecutionContext::setInputShape(). Use
//! IExecutionContext::inferShapes() instead.
//!
- //! \see isShapeBinding(bindingIndex)
+ //! \deprecated Deprecated in TensorRT 10.0. setInputShapeBinding() is removed since TensorRT 10.0.
//!
- bool allInputShapesSpecified() const noexcept
+ TRT_DEPRECATED bool allInputShapesSpecified() const noexcept
{
return mImpl->allInputShapesSpecified();
}
@@ -3038,7 +3564,7 @@ class IExecutionContext : public INoCopy
//! If an error recorder is not set, messages will be sent to the global log stream.
//!
//! \param recorder The error recorder to register with this interface.
- //
+ //!
//! \see getErrorRecorder()
//!
void setErrorRecorder(IErrorRecorder* recorder) noexcept
@@ -3062,52 +3588,22 @@ class IExecutionContext : public INoCopy
}
//!
- //! \brief Synchronously execute inference a network.
+ //! \brief Synchronously execute a network.
+ //!
+ //! This method requires an array of input and output buffers. The mapping
+ //! from indices to tensor names can be queried using ICudaEngine::getIOTensorName().
//!
- //! This method requires an array of input and output buffers. The mapping from tensor names to indices can be
- //! queried using ICudaEngine::getBindingIndex().
- //! This method only works for execution contexts built with full dimension networks.
//! \param bindings An array of pointers to input and output buffers for the network.
//!
//! \return True if execution succeeded.
//!
- //! \see ICudaEngine::getBindingIndex() ICudaEngine::getMaxBatchSize()
+ //! \see ICudaEngine::getIOTensorName()
//!
bool executeV2(void* const* bindings) noexcept
{
return mImpl->executeV2(bindings);
}
- //!
- //! \brief Enqueue inference on a stream.
- //!
- //! This method requires an array of input and output buffers. The mapping from tensor names to indices can be
- //! queried using ICudaEngine::getBindingIndex().
- //! This method only works for execution contexts built with full dimension networks.
- //! \param bindings An array of pointers to input and output buffers for the network.
- //! \param stream A cuda stream on which the inference kernels will be enqueued
- //! \param inputConsumed An optional event which will be signaled when the input buffers can be refilled with new
- //! data
- //!
- //! \return True if the kernels were enqueued successfully.
- //!
- //! \deprecated Superseded by enqueueV3(). Deprecated in TensorRT 8.5
- //!
- //! \see ICudaEngine::getBindingIndex() ICudaEngine::getMaxBatchSize() IExecutionContext::enqueueV3()
- //!
- //! \note Calling enqueueV2() with a stream in CUDA graph capture mode has a known issue. If dynamic shapes are
- //! used, the first enqueueV2() call after a setInputShapeBinding() call will cause failure in stream capture
- //! due to resource allocation. Please call enqueueV2() once before capturing the graph.
- //!
- //! \warning Calling enqueueV2() in from the same IExecutionContext object with different CUDA streams concurrently
- //! results in undefined behavior. To perform inference concurrently in multiple streams, use one execution
- //! context per stream.
- //!
- TRT_DEPRECATED bool enqueueV2(void* const* bindings, cudaStream_t stream, cudaEvent_t* inputConsumed) noexcept
- {
- return mImpl->enqueueV2(bindings, stream, inputConsumed);
- }
-
//!
//! \brief Select an optimization profile for the current context with async
//! semantics.
@@ -3123,24 +3619,22 @@ class IExecutionContext : public INoCopy
//! application’s responsibility to guarantee that synchronization between
//! the profile sync stream and the enqueue stream occurs.
//!
- //! The selected profile will be used in subsequent calls to executeV2()/enqueueV2()/enqueueV3().
+ //! The selected profile will be used in subsequent calls to executeV2()/enqueueV3().
//! If the associated CUDA engine has inputs with dynamic shapes, the optimization profile must
- //! be set with its corresponding profileIndex before calling execute or enqueue. If no execution
- //! context is assigned optimization profile 0 and a new context is created for an engine,
- //! setOptimizationProfile(0) is called implicitly. This functionality is deprecated in TensorRT 8.6
- //! and will instead default all optimization profiles to 0 starting in TensorRT 9.0.
+ //! be set with its corresponding profileIndex before calling execute or enqueue. The newly created execution
+ //! context will be assigned optimization profile 0.
//!
//! If the associated CUDA engine does not have inputs with dynamic shapes,
//! this method need not be called, in which case the default profile index
//! of 0 will be used.
//!
//! setOptimizationProfileAsync() must be called before calling
- //! setBindingDimensions() and setInputShapeBinding() for all dynamic input
+ //! setInputShape() for all dynamic input
//! tensors or input shape tensors, which in turn must be called before
- //! executeV2()/enqueueV2()/enqueueV3().
+ //! executeV2()/enqueueV3().
//!
//! \warning This function will trigger layer resource updates on the next call of
- //! enqueueV2()/executeV2()/enqueueV3(), possibly resulting in performance bottlenecks.
+ //! executeV2()/enqueueV3(), possibly resulting in performance bottlenecks.
//!
//! \warning Not synchronizing the stream used at enqueue with the stream
//! used to set optimization profile asynchronously using this API will
@@ -3149,7 +3643,6 @@ class IExecutionContext : public INoCopy
//! \return true if the call succeeded, else false (e.g. input out of range)
//!
//! \see ICudaEngine::getNbOptimizationProfiles()
- //! \see IExecutionContext::setOptimizationProfile()
bool setOptimizationProfileAsync(int32_t profileIndex, cudaStream_t stream) noexcept
{
return mImpl->setOptimizationProfileAsync(profileIndex, stream);
@@ -3165,6 +3658,7 @@ class IExecutionContext : public INoCopy
//!
//! \see IExecutionContext::getEnqueueEmitsProfile()
//! \see IExecutionContext::reportToProfiler()
+ //!
void setEnqueueEmitsProfile(bool enqueueEmitsProfile) noexcept
{
mImpl->setEnqueueEmitsProfile(enqueueEmitsProfile);
@@ -3176,6 +3670,7 @@ class IExecutionContext : public INoCopy
//! \return The enqueueEmitsProfile state.
//!
//! \see IExecutionContext::setEnqueueEmitsProfile()
+ //!
bool getEnqueueEmitsProfile() const noexcept
{
return mImpl->getEnqueueEmitsProfile();
@@ -3205,6 +3700,7 @@ class IExecutionContext : public INoCopy
//!
//! \see IExecutionContext::setEnqueueEmitsProfile()
//! \see IExecutionContext::getEnqueueEmitsProfile()
+ //!
bool reportToProfiler() const noexcept
{
return mImpl->reportToProfiler();
@@ -3228,8 +3724,7 @@ class IExecutionContext : public INoCopy
//! Before calling enqueueV3(), each input must have a non-null address and
//! each output must have a non-null address or an IOutputAllocator to set it later.
//!
- //! If the TensorLocation of the tensor is kHOST, the pointer must point to a host buffer of sufficient size. For
- //! shape tensors, the only supported data type is int32_t.
+ //! If the TensorLocation of the tensor is kHOST, the pointer must point to a host buffer of sufficient size.
//! If the TensorLocation of the tensor is kDEVICE, the pointer must point to a device buffer of sufficient size and
//! alignment, or be nullptr if the tensor is an output tensor that will be allocated by IOutputAllocator.
//!
@@ -3245,7 +3740,7 @@ class IExecutionContext : public INoCopy
//!
//! \warning The string tensorName must be null-terminated, and be at most 4096 bytes including the terminator.
//!
- //! \see setInputTensorAddress() getTensorShape() setOutputAllocator() IOutputAllocator
+ //! \see setInputTensorAddress() setOutputTensorAddress() getTensorShape() setOutputAllocator() IOutputAllocator
//!
bool setTensorAddress(char const* tensorName, void* data) noexcept
{
@@ -3269,6 +3764,29 @@ class IExecutionContext : public INoCopy
return mImpl->getTensorAddress(tensorName);
}
+ //!
+ //! \brief Set the memory address for a given output tensor.
+ //!
+ //! \param tensorName The name of an output tensor.
+ //! \param data The pointer to the buffer to which to write the output.
+ //!
+ //! \return True on success, false if the provided name does not map to an output tensor, does not meet alignment
+ //! requirements, or some other error occurred.
+ //!
+ //! Output addresses can also be set using method setTensorAddress. This method is provided for applications which
+ //! prefer to use different methods for setting input and output tensors.
+ //!
+ //! See setTensorAddress() for alignment and data type constraints.
+ //!
+ //! \warning The string tensorName must be null-terminated, and be at most 4096 bytes including the terminator.
+ //!
+ //! \see setTensorAddress()
+ //!
+ bool setOutputTensorAddress(char const* tensorName, void* data) noexcept
+ {
+ return mImpl->setOutputTensorAddress(tensorName, data);
+ }
+
//!
//! \brief Set memory address for given input.
//!
@@ -3343,6 +3861,23 @@ class IExecutionContext : public INoCopy
return mImpl->inferShapes(nbMaxNames, tensorNames);
}
+ //!
+ //! \brief Recompute the internal activation buffer sizes based on the current input shapes, and return the total
+ //! amount of memory required.
+ //!
+ //! Users can allocate the device memory based on the size returned and provided the memory to TRT with
+ //! IExecutionContext::setDeviceMemory(). Must specify all input shapes and the optimization profile to use before
+ //! calling this function, otherwise the partition will be invalidated.
+ //!
+ //! \return Total amount of memory required on success, 0 if error occurred.
+ //!
+ //! \see IExecutionContext::setDeviceMemory()
+ //!
+ size_t updateDeviceMemorySizeForShapes() noexcept
+ {
+ return mImpl->updateDeviceMemorySizeForShapes();
+ }
+
//!
//! \brief Mark input as consumed.
//!
@@ -3462,11 +3997,18 @@ class IExecutionContext : public INoCopy
//! Input tensor can be released after the setInputConsumedEvent whereas output tensors require stream
//! synchronization.
//!
+ //! \warning Using default stream may lead to performance issues due to additional cudaDeviceSynchronize() calls by
+ //! TensorRT to ensure correct synchronizations. Please use non-default stream instead.
+ //!
+ //! \warning If the Engine is streaming weights, enqueueV3 will become synchronous, and
+ //! the graph will not be capturable.
+ //!
bool enqueueV3(cudaStream_t stream) noexcept
{
return mImpl->enqueueV3(stream);
}
+ //!
//! \brief Set the maximum size for persistent cache usage.
//!
//! This function sets the maximum persistent L2 cache that this execution context may use for activation caching.
@@ -3496,7 +4038,7 @@ class IExecutionContext : public INoCopy
//!
//! \brief Set the verbosity of the NVTX markers in the execution context.
//!
- //! Building with kDETAILED verbosity will generally increase latency in enqueueV2/enqueueV3(). Call this method
+ //! Building with kDETAILED verbosity will generally increase latency in enqueueV3(). Call this method
//! to select NVTX verbosity in this execution context at runtime.
//!
//! The default is the verbosity with which the engine was built, and the verbosity may not be raised above that
@@ -3560,6 +4102,70 @@ class IExecutionContext : public INoCopy
mImpl->setAuxStreams(auxStreams, nbStreams);
}
+ //!
+ //! \brief Set DebugListener for this execution context.
+ //!
+ //! \param listener DebugListener for this execution context.
+ //!
+ //! \return true if succeed, false if failure.
+ //!
+ bool setDebugListener(IDebugListener* listener) noexcept
+ {
+ return mImpl->setDebugListener(listener);
+ }
+
+ //!
+ //! \brief Get the DebugListener of this execution context.
+ //!
+ //! \return DebugListener of this execution context.
+ //!
+ IDebugListener* getDebugListener() noexcept
+ {
+ return mImpl->getDebugListener();
+ }
+
+ //!
+ //! \brief Set debug state of tensor given the tensor name.
+ //!
+ //! Turn the debug state of a tensor on or off.
+ //! A tensor with the parameter tensor name must exist in the network, and the tensor must have
+ //! been marked as a debug tensor during build time. Otherwise, an error is thrown.
+ //!
+ //! \param name Name of target tensor.
+ //!
+ //! \param flag True if turning on debug state, false if turning off debug state of tensor
+ //! The default is off.
+ //!
+ //! \return True if successful, false otherwise.
+ //!
+ bool setTensorDebugState(char const* name, bool flag) noexcept
+ {
+ return mImpl->setTensorDebugState(name, flag);
+ }
+
+ //!
+ //! Turn the debug state of all debug tensors on or off.
+ //!
+ //! \param flag true if turning on debug state, false if turning off debug state.
+ //!
+ //! \return true if successful, false otherwise.
+ //!
+ //! The default is off.
+ bool setAllTensorsDebugState(bool flag) noexcept
+ {
+ return mImpl->setAllTensorsDebugState(flag);
+ }
+
+ //!
+ //! Get the debug state.
+ //!
+ //! \return true if there is a debug tensor with the given name and it has debug state turned on.
+ //!
+ bool getDebugState(char const* name) const noexcept
+ {
+ return mImpl->getDebugState(name);
+ }
+
protected:
apiv::VExecutionContext* mImpl;
}; // class IExecutionContext
@@ -3693,7 +4299,7 @@ class IEngineInspector : public INoCopy
//! If an error recorder is not set, messages will be sent to the global log stream.
//!
//! \param recorder The error recorder to register with this interface.
- //
+ //!
//! \see getErrorRecorder()
//!
void setErrorRecorder(IErrorRecorder* recorder) noexcept
@@ -3829,6 +4435,169 @@ class ILoggerFinder
virtual ~ILoggerFinder() = default;
};
+//! DO NOT REFER TO namespace v_1_0 IN CODE. ALWAYS USE nvinfer1 INSTEAD.
+//! The name v_1_0 may change in future versions of TensoRT.
+namespace v_1_0
+{
+
+class IGpuAsyncAllocator : public IGpuAllocator
+{
+public:
+ IGpuAsyncAllocator() = default;
+ ~IGpuAsyncAllocator() override = default;
+
+ //!
+ //! \brief A thread-safe callback implemented by the application to handle stream-ordered asynchronous
+ //! acquisition of GPU memory.
+ //!
+ //! \param size The size of the memory block required (in bytes).
+ //! \param alignment The required alignment of memory. Alignment will be zero
+ //! or a power of 2 not exceeding the alignment guaranteed by cudaMalloc.
+ //! Thus this allocator can be safely implemented with cudaMalloc/cudaFree.
+ //! An alignment value of zero indicates any alignment is acceptable.
+ //! \param flags Reserved for future use. In the current release, 0 will be passed.
+ //!
+ //! \param stream Specifies the cudastream for the asynchronous allocation. If nullptr or 0 is
+ //! passed, the default stream will be used.
+ //!
+ //! \return If the allocation was successful, the start address of a device memory block of the requested size.
+ //! If an allocation request of size 0 is made, nullptr must be returned.
+ //! If an allocation request cannot be satisfied, nullptr must be returned.
+ //! If a non-null address is returned, it is guaranteed to have the specified alignment.
+ //!
+ //! \note The implementation must guarantee thread safety for concurrent allocateAsync/deallocateAsync
+ //! requests.
+ //!
+ //! \note The implementation is not required to be asynchronous. It is permitted to synchronize,
+ //! albeit doing so will lose the performance advantage of asynchronous allocation.
+ //!
+ //! \usage
+ //! - Allowed context for the API call
+ //! - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads.
+ //!
+ void* allocateAsync(uint64_t const size, uint64_t const alignment, AllocatorFlags const flags,
+ cudaStream_t /*stream*/) noexcept override = 0;
+
+ //!
+ //! \brief A thread-safe callback implemented by the application to handle stream-ordered asynchronous
+ //! release of GPU memory.
+ //!
+ //! TensorRT may pass a nullptr to this function if it was previously returned by allocate().
+ //!
+ //! \param memory A memory address that was previously returned by an allocate() or reallocate() call of the same
+ //! allocator object.
+ //!
+ //! \param stream Specifies the cudastream for the asynchronous deallocation. If nullptr or 0 is
+ //! passed, the default stream will be used.
+ //!
+ //! \return True if the acquired memory is released successfully.
+ //!
+ //! \note The implementation must guarantee thread safety for concurrent allocateAsync/deallocateAsync
+ //! requests.
+ //!
+ //! \note The implementation is not required to be asynchronous. It is permitted to synchronize,
+ //! albeit doing so will lose the performance advantage of asynchronous deallocation.
+ //! Either way, it is critical that it not actually free the memory until the current
+ //! stream position is reached.
+ //!
+ //! \usage
+ //! - Allowed context for the API call
+ //! - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads.
+ bool deallocateAsync(void* const memory, cudaStream_t /*stream*/) noexcept override = 0;
+
+ //!
+ //! \brief A thread-safe callback implemented by the application to handle acquisition of GPU memory.
+ //!
+ //! \param size The size of the memory block required (in bytes).
+ //! \param alignment The required alignment of memory. Alignment will be zero
+ //! or a power of 2 not exceeding the alignment guaranteed by cudaMalloc.
+ //! Thus this allocator can be safely implemented with cudaMalloc/cudaFree.
+ //! An alignment value of zero indicates any alignment is acceptable.
+ //! \param flags Reserved for future use. In the current release, 0 will be passed.
+ //!
+ //! \return If the allocation was successful, the start address of a device memory block of the requested size.
+ //! If an allocation request of size 0 is made, nullptr must be returned.
+ //! If an allocation request cannot be satisfied, nullptr must be returned.
+ //! If a non-null address is returned, it is guaranteed to have the specified alignment.
+ //!
+ //! \note The implementation must guarantee thread safety for concurrent allocateAsync/deallocateAsync/reallocate requests.
+ //!
+ //! \usage
+ //! - Allowed context for the API call
+ //! - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads.
+ //! \deprecated Deprecated in TensorRT 10.0. Superseded by allocateAsync
+ //!
+ TRT_DEPRECATED void* allocate(
+ uint64_t const size, uint64_t const alignment, AllocatorFlags const flags) noexcept override
+ {
+ return allocateAsync(size, alignment, flags, nullptr);
+ }
+
+ //!
+ //! \brief A thread-safe callback implemented by the application to handle release of GPU memory.
+ //!
+ //! TensorRT may pass a nullptr to this function if it was previously returned by allocate().
+ //!
+ //! \param memory A memory address that was previously returned by an allocate() or reallocate() call of the same
+ //! allocator object.
+ //!
+ //! \return True if the acquired memory is released successfully.
+ //!
+ //! \note The implementation must guarantee thread safety for concurrent allocate/reallocate/deallocate
+ //! requests.
+ //!
+ //! \usage
+ //! - Allowed context for the API call
+ //! - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads.
+ //! \deprecated Deprecated in TensorRT 10.0. Superseded by deallocateAsync
+ //!
+ TRT_DEPRECATED bool deallocate(void* const memory) noexcept override
+ {
+ return deallocateAsync(memory, nullptr);
+ }
+
+ //!
+ //! \brief Return version information associated with this interface. Applications must not override this method.
+ //!
+ InterfaceInfo getInterfaceInfo() const noexcept override
+ {
+ return {"IGpuAllocator", 1, 0};
+ }
+};
+} // namespace v_1_0
+
+//!
+//! \class IGpuAsyncAllocator
+//!
+//! \brief Application-implemented class for controlling asynchronous (stream ordered) memory allocation on the GPU.
+//!
+//! \warning The lifetime of an IGpuAsyncAllocator object must exceed that of all objects that use it.
+//!
+//! The advantage of deriving from IGpuAsyncAllocator instead of IGpuAllocator is that you only have
+//! to override two methods: allocateAsync() and deallocateAsync() to implement an allocator with
+//! asynchronous capability, whereas deriving from IGpuAllocator requires overriding four methods,
+//! including two deprecated methods.
+//!
+//! \see IGpuAllocator
+using IGpuAsyncAllocator = v_1_0::IGpuAsyncAllocator;
+
} // namespace nvinfer1
+//!
+//! \brief Return the library major version number.
+//!
+extern "C" TENSORRTAPI int32_t getInferLibMajorVersion() noexcept;
+//!
+//! \brief Return the library minor version number.
+//!
+extern "C" TENSORRTAPI int32_t getInferLibMinorVersion() noexcept;
+//!
+//! \brief Return the library patch version number.
+//!
+extern "C" TENSORRTAPI int32_t getInferLibPatchVersion() noexcept;
+//!
+//! \brief Return the library build version number.
+//!
+extern "C" TENSORRTAPI int32_t getInferLibBuildVersion() noexcept;
+
#endif // NV_INFER_RUNTIME_H
diff --git a/include/NvInferRuntimeBase.h b/include/NvInferRuntimeBase.h
index 2701240d..60006e6c 100644
--- a/include/NvInferRuntimeBase.h
+++ b/include/NvInferRuntimeBase.h
@@ -68,17 +68,24 @@
//! NvInferSafeRuntime.h (for the safety runtime).
//!
-// forward declare some CUDA types to avoid an include dependency
+//! Forward declare some CUDA types to avoid an include dependency.
extern "C"
{
- //! Forward declaration of cublasContext to use in other interfaces
+ //! Forward declaration of cublasContext to use in other interfaces.
struct cublasContext;
- //! Forward declaration of cudnnContext to use in other interfaces
+ //! Forward declaration of cudnnContext to use in other interfaces.
struct cudnnContext;
}
-#define NV_TENSORRT_VERSION nvinfer1::kNV_TENSORRT_VERSION_IMPL
+//! Construct a single integer denoting TensorRT version.
+//! Usable in preprocessor expressions.
+#define NV_TENSORRT_VERSION_INT(major, minor, patch) ((major) *10000L + (minor) *100L + (patch) *1L)
+
+//! TensorRT version as a single integer.
+//! Usable in preprocessor expressions.
+#define NV_TENSORRT_VERSION NV_TENSORRT_VERSION_INT(NV_TENSORRT_MAJOR, NV_TENSORRT_MINOR, NV_TENSORRT_PATCH)
+
//!
//! \namespace nvinfer1
//!
@@ -86,22 +93,19 @@ extern "C"
//!
namespace nvinfer1
{
-
-static constexpr int32_t kNV_TENSORRT_VERSION_IMPL
- = (NV_TENSORRT_MAJOR * 1000) + (NV_TENSORRT_MINOR * 100) + NV_TENSORRT_PATCH; // major, minor, patch
-
//! char_t is the type used by TensorRT to represent all valid characters.
using char_t = char;
//! AsciiChar is the type used by TensorRT to represent valid ASCII characters.
-//! This type is used by IPluginV2, PluginField, IPluginCreator, IPluginRegistry, and
-//! ILogger due to their use in automotive safety context.
+//! This type is widely used in automotive safety context.
using AsciiChar = char_t;
//! Forward declare IErrorRecorder for use in other interfaces.
+namespace v_1_0
+{
class IErrorRecorder;
-//! Forward declare IGpuAllocator for use in other interfaces.
-class IGpuAllocator;
+}
+using IErrorRecorder = v_1_0::IErrorRecorder;
namespace impl
{
@@ -126,7 +130,7 @@ enum class DataType : int32_t
//! 32-bit floating point format.
kFLOAT = 0,
- //! IEEE 16-bit floating-point format.
+ //! IEEE 16-bit floating-point format -- has a 5 bit exponent and 11 bit significand.
kHALF = 1,
//! Signed 8-bit integer representing a quantized floating-point value.
@@ -148,15 +152,22 @@ enum class DataType : int32_t
//! to equivalent floating point values.
//! {kFLOAT, kHALF} to kUINT8 conversion will convert the floating point values
//! to integer values by truncating towards zero. This conversion has undefined behavior for
- //! floating point values outside the range [0.0f, 256.0f) after truncation.
+ //! floating point values outside the range [0.0F, 256.0F) after truncation.
//! kUINT8 conversions are not supported for {kINT8, kINT32, kBOOL}.
kUINT8 = 5,
//! Signed 8-bit floating point with
//! 1 sign bit, 4 exponent bits, 3 mantissa bits, and exponent-bias 7.
- //! \warning kFP8 is not supported yet and will result in an error or undefined behavior.
- kFP8 = 6
+ kFP8 = 6,
+
+ //! Brain float -- has an 8 bit exponent and 8 bit significand.
+ kBF16 = 7,
+ //! Signed 64-bit integer type.
+ kINT64 = 8,
+
+ //! Signed 4-bit integer type.
+ kINT4 = 9,
};
namespace impl
@@ -165,8 +176,8 @@ namespace impl
template <>
struct EnumMaxImpl
{
- // Declaration of kVALUE that represents maximum number of elements in DataType enum
- static constexpr int32_t kVALUE = 7;
+ //! Declaration of kVALUE that represents the maximum number of elements in the DataType enum.
+ static constexpr int32_t kVALUE = 10;
};
} // namespace impl
@@ -174,29 +185,29 @@ struct EnumMaxImpl
//! \class Dims
//! \brief Structure to define the dimensions of a tensor.
//!
-//! TensorRT can also return an invalid dims structure. This structure is represented by nbDims == -1
-//! and d[i] == 0 for all d.
+//! TensorRT can also return an "invalid dims" structure. This structure is
+//! represented by nbDims == -1 and d[i] == 0 for all i.
//!
-//! TensorRT can also return an "unknown rank" dims structure. This structure is represented by nbDims == -1
-//! and d[i] == -1 for all d.
+//! TensorRT can also return an "unknown rank" dims structure. This structure is
+//! represented by nbDims == -1 and d[i] == -1 for all i.
//!
-class Dims32
+class Dims64
{
public:
//! The maximum rank (number of dimensions) supported for a tensor.
static constexpr int32_t MAX_DIMS{8};
+
//! The rank (number of dimensions).
int32_t nbDims;
+
//! The extent of each dimension.
- int32_t d[MAX_DIMS];
+ int64_t d[MAX_DIMS];
};
//!
-//! Alias for Dims32.
-//!
-//! \warning: This alias might change in the future.
+//! Alias for Dims64.
//!
-using Dims = Dims32;
+using Dims = Dims64;
//!
//! \enum TensorFormat
@@ -207,94 +218,95 @@ using Dims = Dims32;
//!
//! \see IPluginV2::supportsFormat(), safe::ICudaEngine::getBindingFormat()
//!
+//! Many of the formats are **vector-major** or **vector-minor**. These formats specify
+//! a vector dimension and scalars per vector.
+//! For example, suppose that the tensor has has dimensions [M,N,C,H,W],
+//! the vector dimension is C and there are V scalars per vector.
+//!
+//! * A **vector-major** format splits the vectorized dimension into two axes in the
+//! memory layout. The vectorized dimension is replaced by an axis of length ceil(C/V)
+//! and a new dimension of length V is appended. For the example tensor, the memory layout
+//! is equivalent to an array with dimensions [M][N][ceil(C/V)][H][W][V].
+//! Tensor coordinate (m,n,c,h,w) maps to array location [m][n][c/V][h][w][c\%V].
+//!
+//! * A **vector-minor** format moves the vectorized dimension to become the last axis
+//! in the memory layout. For the example tensor, the memory layout is equivalent to an
+//! array with dimensions [M][N][H][W][ceil(C/V)*V]. Tensor coordinate (m,n,c,h,w) maps
+//! array location subscript [m][n][h][w][c].
+//!
+//! In interfaces that refer to "components per element", that's the value of V above.
+//!
//! For more information about data formats, see the topic "Data Format Description" located in the
-//! TensorRT Developer Guide.
+//! TensorRT Developer Guide. https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#data-format-desc
//!
enum class TensorFormat : int32_t
{
- //! Row major linear format.
- //! For a tensor with dimensions {N, C, H, W} or {numbers, channels,
- //! columns, rows}, the dimensional index corresponds to {3, 2, 1, 0}
- //! and thus the order is W minor.
+ //! Memory layout is similar to an array in C or C++.
+ //! The stride of each dimension is the product of the dimensions after it.
+ //! The last dimension has unit stride.
//!
//! For DLA usage, the tensor sizes are limited to C,H,W in the range [1,8192].
- //!
kLINEAR = 0,
- //! Two wide channel vectorized row major format. This format is bound to
- //! FP16. It is only available for dimensions >= 3.
- //! For a tensor with dimensions {N, C, H, W},
- //! the memory layout is equivalent to a C array with dimensions
- //! [N][(C+1)/2][H][W][2], with the tensor coordinates (n, c, h, w)
- //! mapping to array subscript [n][c/2][h][w][c%2].
+ //! Vector-major format with two scalars per vector.
+ //! Vector dimension is third to last.
+ //!
+ //! This format requires FP16 or BF16 and at least three dimensions.
kCHW2 = 1,
- //! Eight channel format where C is padded to a multiple of 8. This format
- //! is bound to FP16. It is only available for dimensions >= 3.
- //! For a tensor with dimensions {N, C, H, W},
- //! the memory layout is equivalent to the array with dimensions
- //! [N][H][W][(C+7)/8*8], with the tensor coordinates (n, c, h, w)
- //! mapping to array subscript [n][h][w][c].
+ //! Vector-minor format with eight scalars per vector.
+ //! Vector dimension is third to last.
+ //! This format requires FP16 or BF16 and at least three dimensions.
kHWC8 = 2,
- //! Four wide channel vectorized row major format. This format is bound to
- //! INT8 or FP16. It is only available for dimensions >= 3.
- //! For INT8, the C dimension must be a build-time constant.
- //! For a tensor with dimensions {N, C, H, W},
- //! the memory layout is equivalent to a C array with dimensions
- //! [N][(C+3)/4][H][W][4], with the tensor coordinates (n, c, h, w)
- //! mapping to array subscript [n][c/4][h][w][c%4].
+ //! Vector-major format with four scalars per vector.
+ //! Vector dimension is third to last.
+ //!
+ //! This format requires INT8 or FP16 and at least three dimensions.
+ //! For INT8, the length of the vector dimension must be a build-time constant.
//!
//! Deprecated usage:
//!
//! If running on the DLA, this format can be used for acceleration
- //! with the caveat that C must be equal or lesser than 4.
+ //! with the caveat that C must be less than or equal to 4.
//! If used as DLA input and the build option kGPU_FALLBACK is not specified,
- //! it needs to meet line stride requirement of DLA format. Column stride in bytes should
- //! be a multiple of 32 on Xavier and 64 on Orin.
+ //! it needs to meet line stride requirement of DLA format. Column stride in
+ //! bytes must be a multiple of 64 on Orin.
kCHW4 = 3,
- //! Sixteen wide channel vectorized row major format. This format is bound
- //! to FP16. It is only available for dimensions >= 3.
- //! For a tensor with dimensions {N, C, H, W},
- //! the memory layout is equivalent to a C array with dimensions
- //! [N][(C+15)/16][H][W][16], with the tensor coordinates (n, c, h, w)
- //! mapping to array subscript [n][c/16][h][w][c%16].
+ //! Vector-major format with 16 scalars per vector.
+ //! Vector dimension is third to last.
+ //!
+ //! This format requires INT8 or FP16 and at least three dimensions.
//!
//! For DLA usage, this format maps to the native feature format for FP16,
//! and the tensor sizes are limited to C,H,W in the range [1,8192].
- //!
kCHW16 = 4,
- //! Thirty-two wide channel vectorized row major format. This format is
- //! only available for dimensions >= 3.
- //! For a tensor with dimensions {N, C, H, W},
- //! the memory layout is equivalent to a C array with dimensions
- //! [N][(C+31)/32][H][W][32], with the tensor coordinates (n, c, h, w)
- //! mapping to array subscript [n][c/32][h][w][c%32].
+ //! Vector-major format with 32 scalars per vector.
+ //! Vector dimension is third to last.
+ //!
+ //! This format requires at least three dimensions.
//!
//! For DLA usage, this format maps to the native feature format for INT8,
//! and the tensor sizes are limited to C,H,W in the range [1,8192].
kCHW32 = 5,
- //! Eight channel format where C is padded to a multiple of 8. This format
- //! is bound to FP16, and it is only available for dimensions >= 4.
- //! For a tensor with dimensions {N, C, D, H, W},
- //! the memory layout is equivalent to an array with dimensions
- //! [N][D][H][W][(C+7)/8*8], with the tensor coordinates (n, c, d, h, w)
- //! mapping to array subscript [n][d][h][w][c].
+ //! Vector-minor format with eight scalars per vector.
+ //! Vector dimension is fourth to last.
+ //!
+ //! This format requires FP16 or BF16 and at least four dimensions.
kDHWC8 = 6,
- //! Thirty-two wide channel vectorized row major format. This format is
- //! bound to FP16 and INT8 and is only available for dimensions >= 4.
- //! For a tensor with dimensions {N, C, D, H, W},
- //! the memory layout is equivalent to a C array with dimensions
- //! [N][(C+31)/32][D][H][W][32], with the tensor coordinates (n, c, d, h, w)
- //! mapping to array subscript [n][c/32][d][h][w][c%32].
+ //! Vector-major format with 32 scalars per vector.
+ //! Vector dimension is fourth to last.
+ //!
+ //! This format requires FP16 or INT8 and at least four dimensions.
kCDHW32 = 7,
- //! Non-vectorized channel-last format. This format is bound to either FP32 or UINT8,
- //! and is only available for dimensions >= 3.
+ //! Vector-minor format where channel dimension is third to last and unpadded.
+ //!
+ //! This format requires either FP32 or UINT8 and at least three dimensions.
kHWC = 8,
//! DLA planar format. For a tensor with dimension {N, C, H, W}, the W axis
@@ -309,46 +321,123 @@ enum class TensorFormat : int32_t
//! DLA image format. For a tensor with dimension {N, C, H, W} the C axis
//! always has unit stride. The stride for stepping along the H axis is rounded up
- //! to 32 bytes on Xavier and 64 bytes on Orin. C can only be 1, 3 or 4.
+ //! to 64 bytes on Orin. C can only be 1, 3 or 4.
//! If C == 1, it will map to grayscale format.
//! If C == 3 or C == 4, it will map to color image format. And if C == 3,
//! the stride for stepping along the W axis needs to be padded to 4 in elements.
//!
//! When C is {1, 3, 4}, then C' is {1, 4, 4} respectively,
//! the memory layout is equivalent to a C array with dimensions
- //! [N][H][roundUp(W, 32/C'/elementSize)][C'] on Xavier and [N][H][roundUp(W, 64/C'/elementSize)][C'] on Orin
+ //! [N][H][roundUp(W, 64/C'/elementSize)][C'] on Orin
//! where elementSize is 2 for FP16
//! and 1 for Int8. The tensor coordinates (n, c, h, w) mapping to array
//! subscript [n][h][w][c].
kDLA_HWC4 = 10,
- //! Sixteen channel format where C is padded to a multiple of 16. This format
- //! is bound to FP16. It is only available for dimensions >= 3.
- //! For a tensor with dimensions {N, C, H, W},
- //! the memory layout is equivalent to the array with dimensions
- //! [N][H][W][(C+15)/16*16], with the tensor coordinates (n, c, h, w)
- //! mapping to array subscript [n][h][w][c].
+ //! Vector-minor format with 16 scalars per vector.
+ //! Vector dimension is third to last.
+ //!
+ //! This requires FP16 and at least three dimensions.
kHWC16 = 11,
- //! Non-vectorized channel-last format. This format is bound to FP32.
- //! It is only available for dimensions >= 4.
+ //! Vector-minor format with one scalar per vector.
+ //! Vector dimension is fourth to last.
+ //!
+ //! This format requires FP32 and at least four dimensions.
kDHWC = 12
};
+using InterfaceKind = char const*;
+
+//!
+//! \class InterfaceInfo
+//!
+//! \brief Version information associated with a TRT interface
+//!
+class InterfaceInfo
+{
+public:
+ InterfaceKind kind;
+ int32_t major;
+ int32_t minor;
+};
+
+//!
+//! \enum APILanguage
+//!
+//! \brief Programming language used in the implementation of a TRT interface
+//!
+enum class APILanguage : int32_t
+{
+ kCPP = 0,
+ kPYTHON = 1
+};
+
+namespace impl
+{
+//! Maximum number of elements in APILanguage enum. \see APILanguage
+template <>
+struct EnumMaxImpl
+{
+ //! Declaration of kVALUE that represents the maximum number of elements in the APILanguage enum.
+ static constexpr int32_t kVALUE = 2;
+};
+} // namespace impl
+
+//!
+//! \class IVersionedInterface
+//!
+//! \brief An Interface class for version control.
+//!
+class IVersionedInterface
+{
+public:
+ //!
+ //! \brief The language used to build the implementation of this Interface.
+ //!
+ //! Applications must not override this method.
+ //!
+ virtual APILanguage getAPILanguage() const noexcept
+ {
+ return APILanguage::kCPP;
+ }
+
+ //!
+ //! \brief Return version information associated with this interface. Applications must not override this method.
+ //!
+ virtual InterfaceInfo getInterfaceInfo() const noexcept = 0;
+
+ virtual ~IVersionedInterface() noexcept = default;
+
+protected:
+ IVersionedInterface() = default;
+ IVersionedInterface(IVersionedInterface const&) = default;
+ IVersionedInterface(IVersionedInterface&&) = default;
+ IVersionedInterface& operator=(IVersionedInterface const&) & = default;
+ IVersionedInterface& operator=(IVersionedInterface&&) & = default;
+};
+
namespace impl
{
//! Maximum number of elements in TensorFormat enum. \see TensorFormat
template <>
struct EnumMaxImpl
{
- //! Declaration of kVALUE that represents maximum number of elements in TensorFormat enum
+ //! Declaration of kVALUE that represents the maximum number of elements in the TensorFormat enum.
static constexpr int32_t kVALUE = 13;
};
} // namespace impl
+
+//!
+//! \enum AllocatorFlag
+//!
+//! \brief Allowed type of memory allocation.
+//!
enum class AllocatorFlag : int32_t
{
- kRESIZABLE = 0, //!< TensorRT may call realloc() on this allocation
+ //! TensorRT may call realloc() on this allocation.
+ kRESIZABLE = 0,
};
namespace impl
@@ -357,72 +446,53 @@ namespace impl
template <>
struct EnumMaxImpl
{
- static constexpr int32_t kVALUE = 1; //!< maximum number of elements in AllocatorFlag enum
+ //! Declaration of kVALUE that represents the maximum number of elements in the AllocatorFlag enum.
+ static constexpr int32_t kVALUE = 1;
};
} // namespace impl
using AllocatorFlags = uint32_t;
-//!
-//! \class IGpuAllocator
-//!
-//! \brief Application-implemented class for controlling allocation on the GPU.
-//!
-class IGpuAllocator
+//! DO NOT REFER TO namespace v_1_0 IN CODE. ALWAYS USE nvinfer1 INSTEAD.
+//! The name v_1_0 may change in future versions of TensoRT.
+namespace v_1_0
+{
+
+class IGpuAllocator : public IVersionedInterface
{
public:
//!
- //! A thread-safe callback implemented by the application to handle acquisition of GPU memory.
+ //! \brief A thread-safe callback implemented by the application to handle acquisition of GPU memory.
//!
- //! \param size The size of the memory required.
+ //! \param size The size of the memory block required (in bytes).
//! \param alignment The required alignment of memory. Alignment will be zero
//! or a power of 2 not exceeding the alignment guaranteed by cudaMalloc.
//! Thus this allocator can be safely implemented with cudaMalloc/cudaFree.
//! An alignment value of zero indicates any alignment is acceptable.
//! \param flags Reserved for future use. In the current release, 0 will be passed.
//!
- //! If an allocation request of size 0 is made, nullptr should be returned.
- //!
- //! If an allocation request cannot be satisfied, nullptr should be returned.
+ //! \return If the allocation was successful, the start address of a device memory block of the requested size.
+ //! If an allocation request of size 0 is made, nullptr must be returned.
+ //! If an allocation request cannot be satisfied, nullptr must be returned.
+ //! If a non-null address is returned, it is guaranteed to have the specified alignment.
//!
- //! \note The implementation must guarantee thread safety for concurrent allocate/free/reallocate/deallocate
+ //! \note The implementation must guarantee thread safety for concurrent allocate/reallocate/deallocate
//! requests.
//!
//! \usage
//! - Allowed context for the API call
//! - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads.
//!
- virtual void* allocate(uint64_t const size, uint64_t const alignment, AllocatorFlags const flags) noexcept = 0;
-
- //!
- //! A thread-safe callback implemented by the application to handle release of GPU memory.
- //!
- //! TensorRT may pass a nullptr to this function if it was previously returned by allocate().
- //!
- //! \param memory The acquired memory.
- //!
- //! \note The implementation must guarantee thread safety for concurrent allocate/free/reallocate/deallocate
- //! requests.
- //!
- //! \see deallocate()
+ //! \deprecated Deprecated in TensorRT 10.0. Superseded by allocateAsync
//!
- //! \deprecated Deprecated in TensorRT 8.0. Superseded by deallocate.
- //!
- //! \usage
- //! - Allowed context for the API call
- //! - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads.
- //!
- TRT_DEPRECATED virtual void free(void* const memory) noexcept = 0;
+ TRT_DEPRECATED virtual void* allocate(
+ uint64_t const size, uint64_t const alignment, AllocatorFlags const flags) noexcept = 0;
- //!
- //! Destructor declared virtual as general good practice for a class with virtual methods.
- //! TensorRT never calls the destructor for an IGpuAllocator defined by the application.
- //!
- virtual ~IGpuAllocator() = default;
+ ~IGpuAllocator() override = default;
IGpuAllocator() = default;
//!
- //! A thread-safe callback implemented by the application to resize an existing allocation.
+ //! \brief A thread-safe callback implemented by the application to resize an existing allocation.
//!
//! Only allocations which were allocated with AllocatorFlag::kRESIZABLE will be resized.
//!
@@ -442,65 +512,161 @@ class IGpuAllocator
//!
//! TensorRT may call realloc to increase the buffer by relatively small amounts.
//!
- //! \param baseAddr the address of the original allocation.
- //! \param alignment The alignment used by the original allocation.
- //! \param newSize The new memory size required.
- //! \return the address of the reallocated memory
+ //! \param baseAddr the address of the original allocation, which will have been returned by previously calling
+ //! allocate() or reallocate() on the same object.
+ //! \param alignment The alignment used by the original allocation. This will be the same value that was previously
+ //! passed to the allocate() or reallocate() call that returned baseAddr.
+ //! \param newSize The new memory size required (in bytes).
+ //!
+ //! \return The address of the reallocated memory, or nullptr. If a non-null address is returned, it is
+ //! guaranteed to have the specified alignment.
//!
- //! \note The implementation must guarantee thread safety for concurrent allocate/free/reallocate/deallocate
+ //! \note The implementation must guarantee thread safety for concurrent allocate/reallocate/deallocate
//! requests.
//!
//! \usage
//! - Allowed context for the API call
//! - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads.
//!
- virtual void* reallocate(void* /*baseAddr*/, uint64_t /*alignment*/, uint64_t /*newSize*/) noexcept
+ virtual void* reallocate(void* const /*baseAddr*/, uint64_t /*alignment*/, uint64_t /*newSize*/) noexcept
{
return nullptr;
}
//!
- //! A thread-safe callback implemented by the application to handle release of GPU memory.
+ //! \brief A thread-safe callback implemented by the application to handle release of GPU memory.
//!
//! TensorRT may pass a nullptr to this function if it was previously returned by allocate().
//!
- //! \param memory The acquired memory.
+ //! \param memory A memory address that was previously returned by an allocate() or reallocate() call of the same
+ //! allocator object.
+ //!
//! \return True if the acquired memory is released successfully.
//!
- //! \note The implementation must guarantee thread safety for concurrent allocate/free/reallocate/deallocate
+ //! \note The implementation must guarantee thread safety for concurrent allocate/reallocate/deallocate
//! requests.
//!
- //! \note If user-implemented free() might hit an error condition, the user should override deallocate() as the
- //! primary implementation and override free() to call deallocate() for backwards compatibility.
+ //! \usage
+ //! - Allowed context for the API call
+ //! - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads.
+ //! \deprecated Deprecated in TensorRT 10.0. Superseded by deallocateAsync
+ //!
+ TRT_DEPRECATED virtual bool deallocate(void* const memory) noexcept = 0;
+
+ //!
+ //! \brief A thread-safe callback implemented by the application to handle stream-ordered acquisition of GPU memory.
+ //!
+ //! The default behavior is to call method allocate(), which is synchronous and thus loses
+ //! any performance benefits of asynchronous allocation. If you want the benefits of asynchronous
+ //! allocation, see discussion of IGpuAsyncAllocator vs. IGpuAllocator in the documentation
+ //! for nvinfer1::IGpuAllocator.
//!
- //! \see free()
+ //! \param size The size of the memory block required (in bytes).
+ //! \param alignment The required alignment of memory. Alignment will be zero
+ //! or a power of 2 not exceeding the alignment guaranteed by cudaMalloc.
+ //! Thus this allocator can be safely implemented with cudaMalloc/cudaFree.
+ //! An alignment value of zero indicates any alignment is acceptable.
+ //! \param flags Reserved for future use. In the current release, 0 will be passed.
+ //! \param stream specifies the cudaStream for asynchronous usage.
+ //!
+ //! \return If the allocation was successful, the start address of a device memory block of the requested size.
+ //! If an allocation request of size 0 is made, nullptr must be returned.
+ //! If an allocation request cannot be satisfied, nullptr must be returned.
+ //! If a non-null address is returned, it is guaranteed to have the specified alignment.
+ //!
+ //! \note The implementation must guarantee thread safety for concurrent allocate/reallocate/deallocate
+ //! requests.
//!
//! \usage
//! - Allowed context for the API call
//! - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads.
//!
- virtual bool deallocate(void* const memory) noexcept
+ virtual void* allocateAsync(
+ uint64_t const size, uint64_t const alignment, AllocatorFlags const flags, cudaStream_t /*stream*/) noexcept
{
- this->free(memory);
- return true;
+ return allocate(size, alignment, flags);
+ }
+ //!
+ //! \brief A thread-safe callback implemented by the application to handle stream-ordered release of GPU memory.
+ //!
+ //! The default behavior is to call method deallocate(), which is synchronous and thus loses
+ //! any performance benefits of asynchronous deallocation. If you want the benefits of asynchronous
+ //! deallocation, see discussion of IGpuAsyncAllocator vs. IGpuAllocator in the documentation
+ //! for nvinfer1::IGpuAllocator.
+ //!
+ //! TensorRT may pass a nullptr to this function if it was previously returned by allocate().
+ //!
+ //! \param memory A memory address that was previously returned by an allocate() or reallocate() call of the same
+ //! allocator object.
+ //! \param stream specifies the cudaStream for asynchronous usage.
+ //!
+ //! \return True if the acquired memory is released successfully.
+ //!
+ //! \note The implementation must guarantee thread safety for concurrent allocate/reallocate/deallocate
+ //! requests.
+ //!
+ //! \note The implementation is not required to be asynchronous. It is permitted to synchronize,
+ //! albeit doing so will lose the performance advantage of asynchronous deallocation.
+ //! Either way, it is critical that it not actually free the memory until the current
+ //! stream position is reached.
+ //!
+ //! \usage
+ //! - Allowed context for the API call
+ //! - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads.
+ //!
+ virtual bool deallocateAsync(void* const memory, cudaStream_t /*stream*/) noexcept
+ {
+ return deallocate(memory);
+ }
+
+ //!
+ //! \brief Return version information associated with this interface. Applications must not override this method.
+ //!
+ InterfaceInfo getInterfaceInfo() const noexcept override
+ {
+ return {"IGpuAllocator", 1, 0};
}
protected:
-// @cond SuppressDoxyWarnings
+ // @cond SuppressDoxyWarnings
IGpuAllocator(IGpuAllocator const&) = default;
IGpuAllocator(IGpuAllocator&&) = default;
IGpuAllocator& operator=(IGpuAllocator const&) & = default;
IGpuAllocator& operator=(IGpuAllocator&&) & = default;
-// @endcond
+ // @endcond
};
+} // namespace v_1_0
+
+//!
+//! \class IGpuAllocator
+//!
+//! \brief Application-implemented class for controlling allocation on the GPU.
+//!
+//! \warning The lifetime of an IGpuAllocator object must exceed that of all objects that use it.
+//!
+//! This class is intended as a base class for allocators that implement synchronous allocation.
+//! If you want the benefits of asynchronous allocation, you can do either of:
+//!
+//! * Derive your class from IGpuAllocator and override all four of its virtual methods
+//! for allocation/deallocation, including the two deprecated methods.
+//!
+//! * Derive your class from IGpuAsyncAllocator and override its two pure virtual
+//! methods for allocation/deallocation.
+//!
+//! The latter style is preferred because it does not tie code to deprecated methods.
+//!
+//! \see IGpuAsyncAllocator.
+//!
+using IGpuAllocator = v_1_0::IGpuAllocator;
+
//!
//! \class ILogger
//!
//! \brief Application-implemented logging interface for the builder, refitter and runtime.
//!
//! The logger used to create an instance of IBuilder, IRuntime or IRefitter is used for all objects created through
-//! that interface. The logger should be valid until all objects created are released.
+//! that interface. The logger must be valid until all objects created are released.
//!
//! The Logger object implementation must be thread safe. All locking and synchronization is pushed to the
//! interface implementation and TensorRT does not hold any synchronization primitives when calling the interface
@@ -512,7 +678,7 @@ class ILogger
//!
//! \enum Severity
//!
- //! The severity corresponding to a log message.
+ //! \brief The severity corresponding to a log message.
//!
enum class Severity : int32_t
{
@@ -529,11 +695,17 @@ class ILogger
};
//!
- //! A callback implemented by the application to handle logging messages;
+ //! \brief A callback implemented by the application to handle logging messages;
//!
//! \param severity The severity of the message.
//! \param msg A null-terminated log message.
//!
+ //! \warning Loggers used in the safety certified runtime must set a maximum message length and truncate
+ //! messages exceeding this length. It is up to the implementer of the derived class to define
+ //! a suitable limit that will prevent buffer overruns, resource exhaustion, and other security
+ //! vulnerabilities in their implementation. The TensorRT safety certified runtime will never
+ //! emit messages longer than 1024 bytes.
+ //!
//! \usage
//! - Allowed context for the API call
//! - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads
@@ -560,7 +732,7 @@ namespace impl
template <>
struct EnumMaxImpl
{
- //! Declaration of kVALUE that represents maximum number of elements in ILogger::Severity enum
+ //! Declaration of kVALUE that represents the maximum number of elements in the ILogger::Severity enum.
static constexpr int32_t kVALUE = 5;
};
} // namespace impl
@@ -617,8 +789,8 @@ enum class ErrorCode : int32_t
kFAILED_INITIALIZATION = 6,
//!
- //! An error occurred during execution that caused TensorRT to end prematurely, either an asynchronous error or
- //! other execution errors reported by CUDA/DLA. In a dynamic system, the
+ //! An error occurred during execution that caused TensorRT to end prematurely, either an asynchronous error,
+ //! user cancellation, or other execution errors reported by CUDA/DLA. In a dynamic system, the
//! data can be thrown away and the next frame can be processed or execution can be retried.
//! This is either an execution error or a memory error.
//!
@@ -649,7 +821,7 @@ enum class ErrorCode : int32_t
//!
//! An error occurred due to the network not being supported on the device due to constraints of the hardware or
- //! system. An example is running a unsafe layer in a safety certified context, or a resource requirement for the
+ //! system. An example is running an unsafe layer in a safety certified context, or a resource requirement for the
//! current network is greater than the capabilities of the target device. The network is otherwise correct, but
//! the network and hardware combination is problematic. This can be recoverable.
//! Examples:
@@ -672,49 +844,36 @@ struct EnumMaxImpl
};
} // namespace impl
-//!
-//! \class IErrorRecorder
-//!
-//! \brief Reference counted application-implemented error reporting interface for TensorRT objects.
-//!
-//! The error reporting mechanism is a user defined object that interacts with the internal state of the object
-//! that it is assigned to in order to determine information about abnormalities in execution. The error recorder
-//! gets both an error enum that is more descriptive than pass/fail and also a string description that gives more
-//! detail on the exact failure modes. In the safety context, the error strings are all limited to 1024 characters
-//! in length.
-//!
-//! The ErrorRecorder gets passed along to any class that is created from another class that has an ErrorRecorder
-//! assigned to it. For example, assigning an ErrorRecorder to an IBuilder allows all INetwork's, ILayer's, and
-//! ITensor's to use the same error recorder. For functions that have their own ErrorRecorder accessor functions.
-//! This allows registering a different error recorder or de-registering of the error recorder for that specific
-//! object.
-//!
-//! The ErrorRecorder object implementation must be thread safe. All locking and synchronization is pushed to the
-//! interface implementation and TensorRT does not hold any synchronization primitives when calling the interface
-//! functions.
-//!
-//! The lifetime of the ErrorRecorder object must exceed the lifetime of all TensorRT objects that use it.
-//!
-class IErrorRecorder
+namespace v_1_0
+{
+class IErrorRecorder : public IVersionedInterface
{
public:
//!
- //! A typedef of a C-style string for reporting error descriptions.
+ //! \brief Return version information associated with this interface. Applications must not override this method.
+ //!
+ InterfaceInfo getInterfaceInfo() const noexcept override
+ {
+ return InterfaceInfo{"IErrorRecorder", 1, 0};
+ }
+
+ //!
+ //! \brief A typedef of a C-style string for reporting error descriptions.
//!
using ErrorDesc = char const*;
//!
- //! The length limit for an error description, excluding the '\0' string terminator.
+ //! \brief The length limit for an error description in bytes, excluding the '\0' string terminator.
//!
static constexpr size_t kMAX_DESC_LENGTH{127U};
//!
- //! A typedef of a 32bit integer for reference counting.
+ //! \brief A typedef of a 32-bit integer for reference counting.
//!
using RefCount = int32_t;
IErrorRecorder() = default;
- virtual ~IErrorRecorder() noexcept = default;
+ ~IErrorRecorder() noexcept override = default;
// Public API used to retrieve information from the error recorder.
@@ -723,13 +882,18 @@ class IErrorRecorder
//!
//! Determines the number of errors that occurred between the current point in execution
//! and the last time that the clear() was executed. Due to the possibility of asynchronous
- //! errors occuring, a TensorRT API can return correct results, but still register errors
- //! with the Error Recorder. The value of getNbErrors must monotonically increases until clear()
- //! is called.
+ //! errors occurring, a TensorRT API can return correct results, but still register errors
+ //! with the Error Recorder. The value of getNbErrors() must increment by 1 after each reportError()
+ //! call until clear() is called, or the maximum number of errors that can be stored is exceeded.
//!
//! \return Returns the number of errors detected, or 0 if there are no errors.
+ //! If the upper bound of errors that can be stored is exceeded, the upper bound value must
+ //! be returned.
+ //!
+ //! For example, if the error recorder can store up to 16 error descriptions but recordError() has
+ //! been called 20 times, getNbErrors() must return 16.
//!
- //! \see clear
+ //! \see clear(), hasOverflowed()
//!
//! \usage
//! - Allowed context for the API call
@@ -746,9 +910,10 @@ class IErrorRecorder
//! The errorIdx specifies what error code from 0 to getNbErrors()-1 that the application
//! wants to analyze and return the error code enum.
//!
- //! \return Returns the enum corresponding to errorIdx.
+ //! \return Returns the enum corresponding to errorIdx if errorIdx is in range (between 0 and getNbErrors()-1).
+ //! ErrorCode::kUNSPECIFIED_ERROR must be returned if errorIdx is not in range.
//!
- //! \see getErrorDesc, ErrorCode
+ //! \see getErrorDesc(), ErrorCode
//!
//! \usage
//! - Allowed context for the API call
@@ -765,11 +930,13 @@ class IErrorRecorder
//! For the error specified by the idx value, return the string description of the error. The
//! error string is a null-terminated C-style string. In the safety context there is a
//! constant length requirement to remove any dynamic memory allocations and the error message
- //! may be truncated. The format of the string is " - ".
+ //! will be truncated if it exceeds kMAX_DESC_LENGTH bytes.
+ //! The format of the string is " - ".
//!
- //! \return Returns a string representation of the error along with a description of the error.
+ //! \return Returns a string representation of the error along with a description of the error if errorIdx is in
+ //! range (between 0 and getNbErrors()-1). An empty string will be returned if errorIdx is not in range.
//!
- //! \see getErrorCode
+ //! \see getErrorCode()
//!
//! \usage
//! - Allowed context for the API call
@@ -797,11 +964,11 @@ class IErrorRecorder
//!
//! \brief Clear the error stack on the error recorder.
//!
- //! Removes all the tracked errors by the error recorder. This function must guarantee that after
+ //! Removes all the tracked errors by the error recorder. The implementation must guarantee that after
//! this function is called, and as long as no error occurs, the next call to getNbErrors will return
- //! zero.
+ //! zero and hasOverflowed will return false.
//!
- //! \see getNbErrors
+ //! \see getNbErrors(), hasOverflowed()
//!
//! \usage
//! - Allowed context for the API call
@@ -816,7 +983,9 @@ class IErrorRecorder
//! \brief Report an error to the error recorder with the corresponding enum and description.
//!
//! \param val The error code enum that is being reported.
- //! \param desc The string description of the error.
+ //! \param desc The string description of the error, which will be a NULL-terminated string of kMAX_DESC_LENGTH
+ //! bytes or less (excluding the NULL terminator). Descriptions that exceed this limit will be silently
+ //! truncated.
//!
//! Report an error to the user that has a given value and human readable description. The function returns false
//! if processing can continue, which implies that the reported error is not fatal. This does not guarantee that
@@ -827,6 +996,10 @@ class IErrorRecorder
//!
//! \return True if the error is determined to be fatal and processing of the current function must end.
//!
+ //! \warning If the error recorder's maximum number of storable errors is exceeded, the error description will be
+ //! silently dropped and the value returned by getNbErrors() will not be incremented. However, the return
+ //! value will still signal whether the error must be considered fatal.
+ //!
//! \usage
//! - Allowed context for the API call
//! - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads
@@ -839,9 +1012,9 @@ class IErrorRecorder
//!
//! Increments the reference count for the object by one and returns the current value. This reference count allows
//! the application to know that an object inside of TensorRT has taken a reference to the ErrorRecorder. TensorRT
- //! guarantees that every call to IErrorRecorder::incRefCount will be paired with a call to
- //! IErrorRecorder::decRefCount when the reference is released. It is undefined behavior to destruct the
- //! ErrorRecorder when incRefCount has been called without a corresponding decRefCount.
+ //! guarantees that every call to IErrorRecorder::incRefCount() will be paired with a call to
+ //! IErrorRecorder::decRefCount() when the reference is released. It is undefined behavior to destruct the
+ //! ErrorRecorder when incRefCount() has been called without a corresponding decRefCount().
//!
//! \return The reference counted value after the increment completes.
//!
@@ -857,9 +1030,9 @@ class IErrorRecorder
//!
//! Decrements the reference count for the object by one and returns the current value. This reference count allows
//! the application to know that an object inside of TensorRT has taken a reference to the ErrorRecorder. TensorRT
- //! guarantees that every call to IErrorRecorder::decRefCount will be preceded by a call to
- //! IErrorRecorder::incRefCount. It is undefined behavior to destruct the ErrorRecorder when incRefCount has been
- //! called without a corresponding decRefCount.
+ //! guarantees that every call to IErrorRecorder::decRefCount() will be preceded by a call to
+ //! IErrorRecorder::incRefCount(). It is undefined behavior to destruct the ErrorRecorder when incRefCount() has been
+ //! called without a corresponding decRefCount().
//!
//! \return The reference counted value after the decrement completes.
//!
@@ -878,6 +1051,36 @@ class IErrorRecorder
IErrorRecorder& operator=(IErrorRecorder&&) & = default;
// @endcond
}; // class IErrorRecorder
+} // namespace v_1_0
+
+//!
+//! \class IErrorRecorder
+//!
+//! \brief Reference counted application-implemented error reporting interface for TensorRT objects.
+//!
+//! The error reporting mechanism is a user-defined object that interacts with the internal state of the object
+//! that it is assigned to in order to determine information about abnormalities in execution. The error recorder
+//! gets both an error enum that is more descriptive than pass/fail and also a string description that gives more
+//! detail on the exact failure modes. In the safety context, the error strings are all limited to 128 bytes
+//! or less in length, including the NULL terminator.
+//!
+//! The ErrorRecorder gets passed along to any class that is created from another class that has an ErrorRecorder
+//! assigned to it. For example, assigning an ErrorRecorder to an IBuilder allows all INetwork's, ILayer's, and
+//! ITensor's to use the same error recorder. For functions that have their own ErrorRecorder accessor functions.
+//! This allows registering a different error recorder or de-registering of the error recorder for that specific
+//! object.
+//!
+//! ErrorRecorder objects that are used in the safety runtime must define an implementation-dependent upper limit
+//! of errors whose information can be stored, and drop errors above this upper limit. The limit must fit in int32_t.
+//! The IErrorRecorder::hasOverflowed() method is used to signal that one or more errors have been dropped.
+//!
+//! The ErrorRecorder object implementation must be thread safe. All locking and synchronization is pushed to the
+//! interface implementation and TensorRT does not hold any synchronization primitives when calling the interface
+//! functions.
+//!
+//! The lifetime of the ErrorRecorder object must exceed the lifetime of all TensorRT objects that use it.
+//!
+using IErrorRecorder = v_1_0::IErrorRecorder;
//!
//! \enum TensorIOMode
@@ -896,6 +1099,116 @@ enum class TensorIOMode : int32_t
kOUTPUT = 2
};
+namespace v_1_0
+{
+class IStreamReader : public IVersionedInterface
+{
+public:
+ //!
+ //! TensorRT never calls the destructor for an IStreamReader defined by the
+ //! application.
+ //!
+ ~IStreamReader() override = default;
+ IStreamReader() = default;
+
+ //!
+ //! \brief Return version information associated with this interface. Applications must not override this method.
+ //!
+ InterfaceInfo getInterfaceInfo() const noexcept override
+ {
+ return InterfaceInfo{"IStreamReader", 1, 0};
+ }
+
+ //!
+ //! \brief Read the next number of bytes in the stream.
+ //!
+ //! \param destination The memory to write to
+ //! \param nbBytes The number of bytes to read
+ //!
+ //! \returns The number of bytes read. Negative values will be considered an automatic error.
+ //!
+ virtual int64_t read(void* destination, int64_t nbBytes) = 0;
+
+protected:
+ IStreamReader(IStreamReader const&) = default;
+ IStreamReader(IStreamReader&&) = default;
+ IStreamReader& operator=(IStreamReader const&) & = default;
+ IStreamReader& operator=(IStreamReader&&) & = default;
+};
+} // namespace v_1_0
+
+//!
+//! \class IStreamReader
+//!
+//! \brief Application-implemented class for reading data in a stream-based manner.
+//!
+//! \note To ensure compatibility of source code with future versions of TensorRT, use IStreamReader, not
+//! v_1_0::IStreamReader
+//!
+using IStreamReader = v_1_0::IStreamReader;
+
+namespace v_1_0
+{
+
+class IPluginResource : public IVersionedInterface
+{
+public:
+ //!
+ //! \brief Return version information associated with this interface. Applications must not override this method.
+ //!
+ InterfaceInfo getInterfaceInfo() const noexcept override
+ {
+ return InterfaceInfo{"IPluginResource", 1, 0};
+ }
+ //!
+ //! \brief Free the underlying resource
+ //!
+ //! This will only be called for IPluginResource objects that were produced from IPluginResource::clone()
+ //!
+ //! The IPluginResource object on which release() is called must still be in a clone-able state
+ //! after release() returns
+ //!
+ //! \return 0 for success, else non-zero
+ //! \usage
+ //! - Allowed context for the API call
+ //! - Thread-safe: No; this method is not required to be thread-safe
+ //!
+ virtual int32_t release() noexcept = 0;
+
+ //!
+ //! \brief Clone the resource object
+ //!
+ //! \note Resource initialization (if any) may be skipped for non-cloned objects since only clones will be
+ //! registered by TensorRT
+ //!
+ //! \return Pointer to cloned object. nullptr if there was an issue.
+ //!
+ //! \usage
+ //! - Allowed context for the API call
+ //! - Thread-safe: Yes; this method is required to be thread-safe and may be called from multiple threads.
+ //!
+ virtual IPluginResource* clone() noexcept = 0;
+
+ ~IPluginResource() noexcept override = default;
+
+ IPluginResource() = default;
+ IPluginResource(IPluginResource const&) = default;
+ IPluginResource(IPluginResource&&) = default;
+ IPluginResource& operator=(IPluginResource const&) & = default;
+ IPluginResource& operator=(IPluginResource&&) & = default;
+}; // class IPluginResource
+} // namespace v_1_0
+
+//!
+//! \class IPluginResource
+//!
+//! \brief Interface for plugins to define custom resources that could be shared through the plugin registry
+//!
+//! \see IPluginRegistry::acquirePluginResource
+//! \see IPluginRegistry::releasePluginResource
+//!
+using IPluginResource = v_1_0::IPluginResource;
+
namespace impl
{
//! Maximum number of elements in TensorIOMode enum. \see TensorIOMode
@@ -911,7 +1224,7 @@ struct EnumMaxImpl
//!
//! \brief Return the library version number.
//!
-//! The format is as for TENSORRT_VERSION: (TENSORRT_MAJOR * 1000) + (TENSORRT_MINOR * 100) + TENSOR_PATCH.
+//! The format is as for TENSORRT_VERSION: (MAJOR * 100 + MINOR) * 100 + PATCH
//!
extern "C" TENSORRTAPI int32_t getInferLibVersion() noexcept;
diff --git a/include/NvInferRuntimeCommon.h b/include/NvInferRuntimeCommon.h
index 9a317e65..65a3c220 100644
--- a/include/NvInferRuntimeCommon.h
+++ b/include/NvInferRuntimeCommon.h
@@ -22,9 +22,9 @@
//! \file NvInferRuntimeCommon.h
//!
//! This file provides the nvinfer1::IPluginRegistry interface, which will be moved to the NvInferRuntime.h header
-//! in TensorRT 9.0.
+//! in a future release.
//!
-//! \warning This file will be removed in TensorRT 9.0.
+//! \warning This file will be removed in a future release.
//!
//! \warning Do not directly include this file. Instead include NvInferRuntime.h
//!
@@ -50,15 +50,17 @@ namespace nvinfer1
//! \warning In the automotive safety context, be sure to call IPluginRegistry::setErrorRecorder() to register
//! an error recorder with the registry before using other methods in the registry.
//!
-
class IPluginRegistry
{
public:
- //! Pointer for plugin library handle.
+ //!
+ //! \brief Pointer for plugin library handle.
+ //!
using PluginLibraryHandle = void*;
+
//!
- //! \brief Register a plugin creator. Returns false if one with same type
- //! is already registered.
+ //! \brief Register a plugin creator implementing IPluginCreator. Returns false if any plugin creator with the same
+ //! name, version or namespace is already registered.
//!
//! \warning The string pluginNamespace must be 1024 bytes or less including the NULL terminator and must be NULL
//! terminated.
@@ -67,17 +69,26 @@ class IPluginRegistry
//! - Allowed context for the API call
//! - Thread-safe: Yes; calls to this method will be synchronized by a mutex.
//!
- virtual bool registerCreator(IPluginCreator& creator, AsciiChar const* const pluginNamespace) noexcept = 0;
+ //! \deprecated Deprecated in TensorRT 10.0. Superseded by
+ //! IPluginRegistry::registerCreator(IPluginCreatorInterface&, AsciiChar const* const).
+ //!
+ TRT_DEPRECATED virtual bool registerCreator(
+ IPluginCreator& creator, AsciiChar const* const pluginNamespace) noexcept = 0;
//!
//! \brief Return all the registered plugin creators and the number of
//! registered plugin creators. Returns nullptr if none found.
//!
+ //! \warning If any plugin creators are registered or deregistered after calling this function, the returned pointer
+ //! is not guaranteed to be valid thereafter.
+ //!
//! \usage
//! - Allowed context for the API call
//! - Thread-safe: No
//!
- virtual IPluginCreator* const* getPluginCreatorList(int32_t* const numCreators) const noexcept = 0;
+ //! \deprecated Deprecated in TensorRT 10.0. Superseded by IPluginRegistry::getAllCreators(int32_t* const).
+ //!
+ TRT_DEPRECATED virtual IPluginCreator* const* getPluginCreatorList(int32_t* const numCreators) const noexcept = 0;
//!
//! \brief Return plugin creator based on plugin name, version, and
@@ -86,13 +97,18 @@ class IPluginRegistry
//! \warning The strings pluginName, pluginVersion, and pluginNamespace must be 1024 bytes or less including the
//! NULL terminator and must be NULL terminated.
//!
+ //! \warning Returns nullptr if a plugin creator with matching name, version, and namespace is found, but is not a
+ //! descendent of IPluginCreator
+ //!
//! \usage
//! - Allowed context for the API call
//! - Thread-safe: Yes
//!
- virtual IPluginCreator* getPluginCreator(AsciiChar const* const pluginName, AsciiChar const* const pluginVersion,
- AsciiChar const* const pluginNamespace = "") noexcept
- = 0;
+ //! \deprecated Deprecated in TensorRT 10.0. Superseded by IPluginRegistry::getCreator(AsciiChar const* const,
+ //! AsciiChar const* const, AsciiChar const* const).
+ //!
+ TRT_DEPRECATED virtual IPluginCreator* getPluginCreator(AsciiChar const* const pluginName,
+ AsciiChar const* const pluginVersion, AsciiChar const* const pluginNamespace = "") noexcept = 0;
// @cond SuppressDoxyWarnings
IPluginRegistry() = default;
@@ -100,7 +116,7 @@ class IPluginRegistry
IPluginRegistry(IPluginRegistry&&) = delete;
IPluginRegistry& operator=(IPluginRegistry const&) & = delete;
IPluginRegistry& operator=(IPluginRegistry&&) & = delete;
-// @endcond
+ // @endcond
protected:
virtual ~IPluginRegistry() noexcept = default;
@@ -115,7 +131,7 @@ class IPluginRegistry
//! a recorder has been registered.
//!
//! \param recorder The error recorder to register with this interface.
- //
+ //!
//! \see getErrorRecorder()
//!
//! \usage
@@ -142,21 +158,23 @@ class IPluginRegistry
virtual IErrorRecorder* getErrorRecorder() const noexcept = 0;
//!
- //! \brief Deregister a previously registered plugin creator.
+ //! \brief Deregister a previously registered plugin creator implementing IPluginCreator.
//!
//! Since there may be a desire to limit the number of plugins,
//! this function provides a mechanism for removing plugin creators registered in TensorRT.
//! The plugin creator that is specified by \p creator is removed from TensorRT and no longer tracked.
//!
//! \return True if the plugin creator was deregistered, false if it was not found in the registry or otherwise
- //! could
- //! not be deregistered.
+ //! could not be deregistered.
//!
//! \usage
//! - Allowed context for the API call
//! - Thread-safe: Yes
//!
- virtual bool deregisterCreator(IPluginCreator const& creator) noexcept = 0;
+ //! \deprecated Deprecated in TensorRT 10.0. Superseded by
+ //! IPluginRegistry::deregisterCreator(IPluginCreatorInterface const&).
+ //!
+ TRT_DEPRECATED virtual bool deregisterCreator(IPluginCreator const& creator) noexcept = 0;
//!
//! \brief Return whether the parent registry will be searched if a plugin is not found in this registry
@@ -194,6 +212,90 @@ class IPluginRegistry
//! \param handle the plugin library handle to deregister.
//!
virtual void deregisterLibrary(PluginLibraryHandle handle) noexcept = 0;
+
+ //!
+ //! \brief Register a plugin creator. Returns false if a plugin creator with the same type
+ //! is already registered.
+ //!
+ //! \warning The string pluginNamespace must be 1024 bytes or less including the NULL terminator and must be NULL
+ //! terminated.
+ //!
+ //! \usage
+ //! - Allowed context for the API call
+ //! - Thread-safe: Yes; calls to this method will be synchronized by a mutex.
+ //!
+ virtual bool registerCreator(IPluginCreatorInterface& creator, AsciiChar const* const pluginNamespace) noexcept = 0;
+
+ //!
+ //! \brief Return all registered plugin creators. Returns nullptr if none found.
+ //!
+ //! \warning If any plugin creators are registered or deregistered after calling this function, the returned pointer
+ //! is not guaranteed to be valid thereafter.
+ //!
+ //! \usage
+ //! - Allowed context for the API call
+ //! - Thread-safe: No
+ //!
+ virtual IPluginCreatorInterface* const* getAllCreators(int32_t* const numCreators) const noexcept = 0;
+
+ //!
+ //! \brief Return a registered plugin creator based on plugin name, version, and namespace associated with the
+ //! plugin during network creation.
+ //!
+ //! \warning The strings pluginName, pluginVersion, and pluginNamespace must be 1024 bytes or less including the
+ //! NULL terminator and must be NULL terminated.
+ //!
+ //! \usage
+ //! - Allowed context for the API call
+ //! - Thread-safe: Yes
+ //!
+ virtual IPluginCreatorInterface* getCreator(AsciiChar const* const pluginName, AsciiChar const* const pluginVersion,
+ AsciiChar const* const pluginNamespace = "") noexcept = 0;
+
+ //!
+ //! \brief Deregister a previously registered plugin creator.
+ //!
+ //! Since there may be a desire to limit the number of plugins,
+ //! this function provides a mechanism for removing plugin creators registered in TensorRT.
+ //! The plugin creator that is specified by \p creator is removed from TensorRT and no longer tracked.
+ //!
+ //! \return True if the plugin creator was deregistered, false if it was not found in the registry or otherwise
+ //! could not be deregistered.
+ //!
+ //! \usage
+ //! - Allowed context for the API call
+ //! - Thread-safe: Yes
+ //!
+ virtual bool deregisterCreator(IPluginCreatorInterface const& creator) noexcept = 0;
+
+ //!
+ //! \brief Get a plugin resource
+ //! \param key Key for identifying the resource. Cannot be null.
+ //! \param resource A plugin resource object. The object will only need to be valid until this method returns, as
+ //! only a clone of this object will be registered by TRT. Cannot be null.
+ //!
+ //! \return Registered plugin resource object
+ //!
+ //! \usage
+ //! - Allowed context for the API call
+ //! - Thread-safe: Yes; calls to this method will be synchronized by a mutex.
+ //!
+ virtual IPluginResource* acquirePluginResource(AsciiChar const* key, IPluginResource* resource) noexcept = 0;
+
+ //!
+ //! \brief Decrement reference count for the resource with this key
+ //! If reference count goes to zero after decrement, release() will be invoked on the resource, the key will
+ //! be deregistered and the resource object will be deleted
+ //!
+ //! \param key Key that was used to register the resource. Cannot be null.
+ //!
+ //! \return 0 for success, else non-zero
+ //!
+ //! \usage
+ //! - Allowed context for the API call
+ //! - Thread-safe: Yes; calls to this method will be synchronized by a mutex.
+ //!
+ virtual int32_t releasePluginResource(AsciiChar const* key) noexcept = 0;
};
} // namespace nvinfer1
diff --git a/include/NvInferRuntimePlugin.h b/include/NvInferRuntimePlugin.h
index 208d4e88..ecae2ce9 100644
--- a/include/NvInferRuntimePlugin.h
+++ b/include/NvInferRuntimePlugin.h
@@ -45,12 +45,18 @@ namespace nvinfer1
//!
using PluginFormat = TensorFormat;
+//!
+//! \brief Bit at the plugin version to identify that it is a plugin.
+//!
+static constexpr int32_t kPLUGIN_VERSION_PYTHON_BIT = 0x40;
+
+//!
//! \struct PluginTensorDesc
//!
//! \brief Fields that a plugin might see for an input or output.
//!
//! Scale is only valid when data type is DataType::kINT8. TensorRT will set
-//! the value to -1.0f if it is invalid.
+//! the value to -1.0F if it is invalid.
//!
//! \see IPluginV2IOExt::supportsFormatCombination
//! \see IPluginV2IOExt::configurePlugin
@@ -67,6 +73,7 @@ struct PluginTensorDesc
float scale;
};
+//!
//! \struct PluginVersion
//!
//! \brief Definition of plugin versions.
@@ -83,8 +90,24 @@ enum class PluginVersion : uint8_t
kV2_IOEXT = 2,
//! IPluginV2DynamicExt
kV2_DYNAMICEXT = 3,
+ //! IPluginV2DynamicExt-based Python plugins
+ kV2_DYNAMICEXT_PYTHON = kPLUGIN_VERSION_PYTHON_BIT | 3
+};
+
+//!
+//! \enum PluginCreatorVersion
+//!
+//! \brief Enum to identify version of the plugin creator.
+//!
+enum class PluginCreatorVersion : int32_t
+{
+ //! IPluginCreator
+ kV1 = 0,
+ //! IPluginCreator-based Python plugin creators
+ kV1_PYTHON = kPLUGIN_VERSION_PYTHON_BIT
};
+//!
//! \class IPluginV2
//!
//! \brief Plugin class for user-implemented layers.
@@ -108,6 +131,8 @@ class TRT_DEPRECATED IPluginV2
//! Do not override this method as it is used by the TensorRT library to maintain backwards-compatibility with
//! plugins.
//!
+ //! \return The TensorRT version in the format (major * 100 + minor) * 100 + patch.
+ //!
//! \usage
//! - Allowed context for the API call
//! - Thread-safe: Yes, the implementation provided here is safe to call from any thread.
@@ -119,10 +144,11 @@ class TRT_DEPRECATED IPluginV2
//!
//! \brief Return the plugin type. Should match the plugin name returned by the corresponding plugin creator
+ //!
//! \see IPluginCreator::getPluginName()
//!
- //! \warning The string returned must be 1024 bytes or less including the NULL terminator and must be NULL
- //! terminated.
+ //! \warning The string returned must be NULL-terminated and have a length of 1024 bytes or less including the
+ //! NULL terminator.
//!
//! \usage
//! - Allowed context for the API call
@@ -133,10 +159,11 @@ class TRT_DEPRECATED IPluginV2
//!
//! \brief Return the plugin version. Should match the plugin version returned by the corresponding plugin creator
+ //!
//! \see IPluginCreator::getPluginVersion()
//!
- //! \warning The string returned must be 1024 bytes or less including the NULL terminator and must be NULL
- //! terminated.
+ //! \warning The string returned must be NULL-terminated and have a length of 1024 bytes or less including the
+ //! NULL terminator.
//!
//! \usage
//! - Allowed context for the API call
@@ -148,7 +175,7 @@ class TRT_DEPRECATED IPluginV2
//!
//! \brief Get the number of outputs from the layer.
//!
- //! \return The number of outputs.
+ //! \return The number of outputs, which is a positive integer.
//!
//! This function is called by the implementations of INetworkDefinition and IBuilder. In particular, it is called
//! prior to any call to initialize().
@@ -163,9 +190,13 @@ class TRT_DEPRECATED IPluginV2
//!
//! \brief Get the dimension of an output tensor.
//!
- //! \param index The index of the output tensor.
- //! \param inputs The input tensors.
- //! \param nbInputDims The number of input tensors.
+ //! \param index The index of the output tensor. Will lie in the valid range (between 0 and getNbOutputs()-1
+ //! inclusive).
+ //! \param inputs The input tensor dimensions. Will be the start address of a Dims array of length nbInputDims.
+ //! \param nbInputDims The number of input tensors. Will be a non-negative integer.
+ //!
+ //! \return The output tensor dimensions if the index is in the valid range.
+ //! An invalid value of Dims{-1, {}} must be returned if the index is not in the valid range.
//!
//! This function is called by the implementations of INetworkDefinition and IBuilder. In particular, it is called
//! prior to any call to initialize().
@@ -175,7 +206,7 @@ class TRT_DEPRECATED IPluginV2
//! - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads
//! when building networks on multiple devices sharing the same plugin.
//!
- //! \note In any non-IPluginV2DynamicExt plugin, batch size should not be included in the returned dimensions,
+ //! \note In any non-IPluginV2DynamicExt plugin, batch size must not be included in the returned dimensions,
//! even if the plugin is expected to be run in a network with explicit batch mode enabled.
//! Please see the TensorRT Developer Guide for more details on how plugin inputs and outputs behave.
//!
@@ -186,6 +217,7 @@ class TRT_DEPRECATED IPluginV2
//!
//! \param type DataType requested.
//! \param format PluginFormat requested.
+ //!
//! \return true if the plugin supports the type-format combination.
//!
//! This function is called by the implementations of INetworkDefinition, IBuilder, and
@@ -211,13 +243,14 @@ class TRT_DEPRECATED IPluginV2
//! This function is called by the builder prior to initialize(). It provides an opportunity for the layer to make
//! algorithm choices on the basis of its weights, dimensions, and maximum batch size.
//!
- //! \param inputDims The input tensor dimensions.
- //! \param nbInputs The number of inputs.
- //! \param outputDims The output tensor dimensions.
- //! \param nbOutputs The number of outputs.
+ //! \param inputDims The input tensor dimensions. Will be the start address of a Dims array of length nbInputs.
+ //! \param nbInputs The number of inputs. Will be a non-negative integer.
+ //! \param outputDims The output tensor dimensions. Will be the start address of a Dims array of length nbOutputs.
+ //! \param nbOutputs The number of outputs. Will be a positive integer identical to the return value of
+ //! getNbOutputs().
//! \param type The data type selected for the engine.
//! \param format The format selected for the engine.
- //! \param maxBatchSize The maximum batch size.
+ //! \param maxBatchSize The maximum batch size. Will be a positive integer.
//!
//! The dimensions passed here do not include the outermost batch size (i.e. for 2-D image networks, they will be
//! 3-dimensional CHW dimensions).
@@ -256,6 +289,7 @@ class TRT_DEPRECATED IPluginV2
//!
//! \brief Release resources acquired during plugin layer initialization. This is called when the engine is
//! destroyed.
+ //!
//! \see initialize()
//!
//! \usage
@@ -270,10 +304,13 @@ class TRT_DEPRECATED IPluginV2
//!
//! \brief Find the workspace size required by the layer.
//!
- //! This function is called during engine startup, after initialize(). The workspace size returned should be
+ //! This function is called during engine startup, after initialize(). The workspace size returned must be
//! sufficient for any batch size up to the maximum.
//!
- //! \return The workspace size.
+ //! \param maxBatchSize The maximum batch size, which will be a positive integer.
+ //!
+ //! \return The workspace size in bytes, i.e. the device memory size that the plugin requires for its internal
+ //! computations.
//!
//! \usage
//! - Allowed context for the API call
@@ -287,10 +324,15 @@ class TRT_DEPRECATED IPluginV2
//! \brief Execute the layer.
//!
//! \param batchSize The number of inputs in the batch.
- //! \param inputs The memory for the input tensors.
- //! \param outputs The memory for the output tensors.
- //! \param workspace Workspace for execution.
- //! \param stream The stream in which to execute the kernels.
+ //! \param inputs The memory for the input tensors. Will be an array of device addresses corresponding to input
+ //! tensors of length nbInputs, where nbInputs is the second parameter passed to configureWithFormat().
+ //! The i-th input tensor will have the dimensions inputDims[i], where inputDims is the first parameter
+ //! that was passed to configureWithFormat().
+ //! \param outputs The memory for the output tensors. Will be an array of device addresses corresponding to output
+ //! tensors of length getNbOutputs().
+ //! \param workspace Workspace for execution. Will be the start address of a device buffer whose length will be at
+ //! least getWorkspaceSize(batchSize).
+ //! \param stream The stream in which to execute the kernels. This will be a valid CUDA stream.
//!
//! \return 0 for success, else non-zero (which will cause engine termination).
//!
@@ -304,9 +346,9 @@ class TRT_DEPRECATED IPluginV2
= 0;
//!
- //! \brief Find the size of the serialization buffer required.
+ //! \brief Find the size of the serialization buffer required to store the plugin configuration in a binary file.
//!
- //! \return The size of the serialization buffer.
+ //! \return The size of the serialization buffer in bytes.
//!
//! \usage
//! - Allowed context for the API call
@@ -318,8 +360,8 @@ class TRT_DEPRECATED IPluginV2
//!
//! \brief Serialize the layer.
//!
- //! \param buffer A pointer to a buffer to serialize data. Size of buffer must be equal to value returned by
- //! getSerializationSize.
+ //! \param buffer A pointer to a host buffer to serialize data. Size of buffer will be at least as large as the
+ //! value returned by getSerializationSize.
//!
//! \see getSerializationSize()
//!
@@ -346,7 +388,10 @@ class TRT_DEPRECATED IPluginV2
//!
//! The TensorRT runtime calls clone() to clone the plugin when an execution context is created for an engine,
//! after the engine has been created. The runtime does not call initialize() on the cloned plugin,
- //! so the cloned plugin should be created in an initialized state.
+ //! so the cloned plugin must be created in an initialized state.
+ //!
+ //! \return A cloned plugin object in an initialized state with the same parameters as the current object.
+ //! nullptr must be returned if the cloning fails, e.g. because of resource exhaustion.
//!
//! \usage
//! - Allowed context for the API call
@@ -358,12 +403,12 @@ class TRT_DEPRECATED IPluginV2
//!
//! \brief Set the namespace that this plugin object belongs to. Ideally, all plugin
- //! objects from the same plugin library should have the same namespace.
+ //! objects from the same plugin library must have the same namespace.
//!
//! \param pluginNamespace The namespace for the plugin object.
//!
- //! \warning The string pluginNamespace must be 1024 bytes or less including the NULL terminator and must be NULL
- //! terminated.
+ //! \warning The string pluginNamespace will be NULL-terminated and have a length of 1024 bytes or less including the
+ //! NULL terminator.
//!
//! \usage
//! - Allowed context for the API call
@@ -375,6 +420,9 @@ class TRT_DEPRECATED IPluginV2
//!
//! \brief Return the namespace of the plugin object.
//!
+ //! \return The namespace string that was passed to setPluginNamespace(), possibly after truncation to 1024 bytes
+ //! if a longer string was passed. An empty string must be returned as default value.
+ //!
//! \usage
//! - Allowed context for the API call
//! - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads
@@ -396,13 +444,14 @@ class TRT_DEPRECATED IPluginV2
// @endcond
};
+//!
//! \class IPluginV2Ext
//!
//! \brief Plugin class for user-implemented layers.
//!
//! Plugins are a mechanism for applications to implement custom layers. This
//! interface provides additional capabilities to the IPluginV2 interface by
-//! supporting different output data types and broadcast across batch.
+//! supporting different output data types and broadcast across batches.
//!
//! \see IPluginV2
//!
@@ -415,7 +464,15 @@ class TRT_DEPRECATED IPluginV2Ext : public IPluginV2
//!
//! \brief Return the DataType of the plugin output at the requested index.
//!
- //! The default behavior should be to return the type of the first input, or DataType::kFLOAT if the layer has no
+ //! \param index The output tensor index in the valid range between 0 and getNbOutputs()-1.
+ //! \param inputTypes The data types of the input tensors, stored in an array of length nbInputs.
+ //! \param nbInputs The number of input tensors. Will be a non-negative integer.
+ //!
+ //! \return The data type of the output tensor with the provided index if the input tensors have the data types
+ //! provided in inputTypes, provided the output tensor index is in the valid range. DataType::kFLOAT must be
+ //! returned if the index is not in the valid range.
+ //!
+ //! The default behavior must be to return the type of the first input, or DataType::kFLOAT if the layer has no
//! inputs. The returned data type must have a format that is supported by the plugin.
//!
//! \see supportsFormat()
@@ -431,11 +488,14 @@ class TRT_DEPRECATED IPluginV2Ext : public IPluginV2
int32_t index, nvinfer1::DataType const* inputTypes, int32_t nbInputs) const noexcept
= 0;
- //! \brief Return true if output tensor is broadcast across a batch.
//!
- //! \param outputIndex The index of the output
- //! \param inputIsBroadcasted The ith element is true if the tensor for the ith input is broadcast across a batch.
- //! \param nbInputs The number of inputs
+ //! \brief Return true if the output tensor is broadcast across a batch.
+ //!
+ //! \param outputIndex The index of the output tensor, which will be in the valid range between 0 and
+ //! nbOutputs()-1.
+ //! \param inputIsBroadcasted A boolean array of length nbInputs. The i-th element will be true if and only if
+ //! the tensor for the ith input is broadcast across a batch.
+ //! \param nbInputs The number of inputs. Will be a non-negative integer.
//!
//! The values in inputIsBroadcasted refer to broadcasting at the semantic level,
//! i.e. are unaffected by whether method canBroadcastInputAcrossBatch requests
@@ -446,18 +506,25 @@ class TRT_DEPRECATED IPluginV2Ext : public IPluginV2
//! - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads
//! when building networks on multiple devices sharing the same plugin.
//!
- virtual bool isOutputBroadcastAcrossBatch(
+ //! \deprecated Deprecated in TensorRT 10.0. Implicit batch support is removed in TensorRT 10.0.
+ //!
+ TRT_DEPRECATED virtual bool isOutputBroadcastAcrossBatch(
int32_t outputIndex, bool const* inputIsBroadcasted, int32_t nbInputs) const noexcept
= 0;
- //! \brief Return true if plugin can use input that is broadcast across batch without replication.
//!
- //! \param inputIndex Index of input that could be broadcast.
+ //! \brief Return true if the plugin can use an input tensor that is broadcast across batch without replication.
+ //!
+ //! \param inputIndex Index of input that could be broadcast. Will be in the valid range between 0 and
+ //! nbInputs - 1 where nbInputs is the maximum number of input tensors supported by this plugin.
+ //!
+ //! \return true if the index is in the valid range and the plugin is able to broadcast a single copy of this
+ //! input tensor across the batch. False otherwise.
//!
//! For each input whose tensor is semantically broadcast across a batch,
//! TensorRT calls this method before calling configurePlugin.
//! If canBroadcastInputAcrossBatch returns true, TensorRT will not replicate the input tensor;
- //! i.e., there will be a single copy that the plugin should share across the batch.
+ //! i.e., there will be a single copy that the plugin must share across the batch.
//! If it returns false, TensorRT will replicate the input tensor
//! so that it appears like a non-broadcasted tensor.
//!
@@ -468,7 +535,9 @@ class TRT_DEPRECATED IPluginV2Ext : public IPluginV2
//! - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads
//! when building networks on multiple devices sharing the same plugin.
//!
- virtual bool canBroadcastInputAcrossBatch(int32_t inputIndex) const noexcept = 0;
+ //! \deprecated Deprecated in TensorRT 10.0. Implicit batch support is removed in TensorRT 10.0.
+ //!
+ TRT_DEPRECATED virtual bool canBroadcastInputAcrossBatch(int32_t inputIndex) const noexcept = 0;
//!
//! \brief Configure the layer with input and output data types.
@@ -476,20 +545,22 @@ class TRT_DEPRECATED IPluginV2Ext : public IPluginV2
//! This function is called by the builder prior to initialize(). It provides an opportunity for the layer to make
//! algorithm choices on the basis of its weights, dimensions, data types and maximum batch size.
//!
- //! \param inputDims The input tensor dimensions.
- //! \param nbInputs The number of inputs.
- //! \param outputDims The output tensor dimensions.
- //! \param nbOutputs The number of outputs.
- //! \param inputTypes The data types selected for the plugin inputs.
- //! \param outputTypes The data types selected for the plugin outputs.
+ //! \param inputDims The input tensor dimensions. Will be an array of length nbInputs.
+ //! \param nbInputs The number of inputs. Will be a non-negative integer.
+ //! \param outputDims The output tensor dimensions. Will be an array of length nbOutputs.
+ //! \param nbOutputs The number of outputs. Will be a positive integer.
+ //! \param inputTypes The data types selected for the plugin inputs. Will be an array of length nbInputs.
+ //! \param outputTypes The data types selected for the plugin outputs. Will be an array of length nbOutputs.
//! \param inputIsBroadcast True for each input that the plugin must broadcast across the batch.
+ //! Will be an array of length nbInputs.
//! \param outputIsBroadcast True for each output that TensorRT will broadcast across the batch.
+ //! Will be an array of length nbOutputs.
//! \param floatFormat The format selected for the engine for the floating point inputs/outputs.
- //! \param maxBatchSize The maximum batch size.
+ //! \param maxBatchSize The maximum batch size. Will be a positive integer.
//!
//! The dimensions passed here do not include the outermost batch size (i.e. for 2-D image networks, they will be
//! 3-dimensional CHW dimensions). When inputIsBroadcast or outputIsBroadcast is true, the outermost batch size for
- //! that input or output should be treated as if it is one.
+ //! that input or output must be treated as if it is one.
//! Index 'i' of inputIsBroadcast is true only if the input is semantically broadcast across the batch and
//! calling canBroadcastInputAcrossBatch with argument 'i' returns true.
//! Index 'i' of outputIsBroadcast is true only if calling isOutputBroadcastAcrossBatch with argument 'i'
@@ -515,10 +586,12 @@ class TRT_DEPRECATED IPluginV2Ext : public IPluginV2
//!
//! \brief Attach the plugin object to an execution context and grant the plugin the access to some context
- //! resource.
+ //! resources.
//!
- //! \param cudnn The CUDNN context handle of the execution context
- //! \param cublas The cublas context handle of the execution context
+ //! \param cudnn The cuDNN context handle of the execution context. Will be a valid cuDNN context handle, or
+ //! nullptr if TacticSource::kCUDNN is disabled.
+ //! \param cublas The cuBLAS context handle of the execution context. Will be a valid cuBLAS context handle, or
+ //! nullptr if TacticSource::kCUBLAS is disabled.
//! \param allocator The allocator used by the execution context
//!
//! This function is called automatically for each plugin when a new execution context is created. If the context
@@ -526,10 +599,19 @@ class TRT_DEPRECATED IPluginV2Ext : public IPluginV2
//! new resources are assigned to the context.
//!
//! If the plugin needs per-context resource, it can be allocated here.
- //! The plugin can also get context-owned CUDNN and CUBLAS context here.
+ //! The plugin can also get context-owned cuDNN and cuBLAS context here.
+ //!
+ //! \note The TacticSource::kCUDNN and TacticSource::kCUBLAS flag is disabled by default.
+ //! The allocator pointer is unique to each building or execution context instance having overlapping lifetimes.
+ //! It can be used as a key to manage resources across plugin instances sharing the same context.
+ //! Plugins attached to different contexts will have different handles as their execution will not overlap.
+ //!
+ //! \see TacticSources
+ //! \see getPluginCudnnHandle(void* executionContextIdentifier)
+ //! \see getPluginCublasHandle(void* excecutionContextIdentifier)
//!
- //! \note In the automotive safety context, the CUDNN and CUBLAS parameters will be nullptr because CUDNN and CUBLAS
- //! is not used by the safe runtime.
+ //! \note In the automotive safety context, the cuDNN and cuBLAS parameters will be nullptr because cuDNN and cuBLAS
+ //! are not used by the safe runtime.
//!
//! \usage
//! - Allowed context for the API call
@@ -544,7 +626,7 @@ class TRT_DEPRECATED IPluginV2Ext : public IPluginV2
//!
//! \brief Detach the plugin object from its execution context.
//!
- //! This function is called automatically for each plugin when a execution context is destroyed or the context
+ //! This function is called automatically for each plugin when an execution context is destroyed or the context
//! resources are unassigned from the context.
//!
//! If the plugin owns per-context resource, it can be released here.
@@ -559,10 +641,12 @@ class TRT_DEPRECATED IPluginV2Ext : public IPluginV2
//!
//! \brief Clone the plugin object. This copies over internal plugin parameters as well and returns a new plugin
//! object with these parameters. If the source plugin is pre-configured with configurePlugin(), the returned object
- //! should also be pre-configured. The returned object should allow attachToContext() with a new execution context.
+ //! must also be pre-configured. The returned object must allow attachToContext() with a new execution context.
//! Cloned plugin objects can share the same per-engine immutable resource (e.g. weights) with the source object
//! (e.g. via ref-counting) to avoid duplication.
//!
+ //! \return A pointer to a cloned plugin object if cloning was successful, otherwise nullptr.
+ //!
//! \usage
//! - Allowed context for the API call
//! - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads
@@ -582,6 +666,10 @@ class TRT_DEPRECATED IPluginV2Ext : public IPluginV2
//! \brief Return the API version with which this plugin was built. The
//! upper byte reserved by TensorRT and is used to differentiate this from IPluginV2.
//!
+ //! \return In the lower three bytes, the TensorRT version in the format
+ //! (major * 100 + minor) * 100 + patch.
+ //! In the upper byte, the value 1.
+ //!
//! Do not override this method as it is used by the TensorRT library to maintain backwards-compatibility with
//! plugins.
//!
@@ -596,7 +684,10 @@ class TRT_DEPRECATED IPluginV2Ext : public IPluginV2
}
//!
- //! \brief Derived classes should not implement this. In a C++11 API it would be override final.
+ //! \brief Derived classes must not implement this. In a C++11 API it would be override final.
+ //!
+ //! IPluginV2Ext::configureWithFormat() is a NOP operation for all classes derived from IPluginV2Ext.
+ //! These classes call configurePlugin() instead.
//!
void configureWithFormat(Dims const* /*inputDims*/, int32_t /*nbInputs*/, Dims const* /*outputDims*/,
int32_t /*nbOutputs*/, DataType /*type*/, PluginFormat /*format*/, int32_t /*maxBatchSize*/) noexcept override
@@ -604,6 +695,7 @@ class TRT_DEPRECATED IPluginV2Ext : public IPluginV2
}
};
+//!
//! \class IPluginV2IOExt
//!
//! \brief Plugin class for user-implemented layers.
@@ -613,7 +705,9 @@ class TRT_DEPRECATED IPluginV2Ext : public IPluginV2
//!
//! \see IPluginV2Ext
//!
-class IPluginV2IOExt : public IPluginV2Ext
+//! \deprecated Deprecated in TensorRT 10.0.
+//!
+class TRT_DEPRECATED IPluginV2IOExt : public IPluginV2Ext
{
public:
//!
@@ -644,10 +738,10 @@ class IPluginV2IOExt : public IPluginV2Ext
//! Using this numbering, pos is an index into InOut, where 0 <= pos < nbInputs+nbOutputs.
//!
//! TensorRT invokes this method to ask if the input/output indexed by pos supports the format/datatype specified
- //! by inOut[pos].format and inOut[pos].type. The override should return true if that format/datatype at inOut[pos]
+ //! by inOut[pos].format and inOut[pos].type. The override must return true if that format/datatype at inOut[pos]
//! are supported by the plugin. If support is conditional on other input/output formats/datatypes, the plugin can
//! make its result conditional on the formats/datatypes in inOut[0..pos-1], which will be set to values
- //! that the plugin supports. The override should not inspect inOut[pos+1..nbInputs+nbOutputs-1],
+ //! that the plugin supports. The override must not inspect inOut[pos+1..nbInputs+nbOutputs-1],
//! which will have invalid values. In other words, the decision for pos must be based on inOut[0..pos] only.
//!
//! Some examples:
@@ -711,11 +805,17 @@ class IPluginV2IOExt : public IPluginV2Ext
private:
// Following are obsolete base class methods, and must not be implemented or used.
+ //!
+ //! \brief Set plugin configuration.
+ //!
void configurePlugin(Dims const*, int32_t, Dims const*, int32_t, DataType const*, DataType const*, bool const*,
bool const*, PluginFormat, int32_t) noexcept final
{
}
+ //!
+ //! \brief Check if provided data type is supported.
+ //!
bool supportsFormat(DataType, PluginFormat) const noexcept final
{
return false;
@@ -724,9 +824,9 @@ class IPluginV2IOExt : public IPluginV2Ext
//!
//! \enum PluginFieldType
+//!
//! \brief The possible field types for custom layer.
//!
-
enum class PluginFieldType : int32_t
{
//! FP16 field type.
@@ -746,7 +846,13 @@ enum class PluginFieldType : int32_t
//! nvinfer1::Dims field type.
kDIMS = 7,
//! Unknown field type.
- kUNKNOWN = 8
+ kUNKNOWN = 8,
+ //! BF16 field type.
+ kBF16 = 9,
+ //! INT64 field type.
+ kINT64 = 10,
+ //! FP8 field type.
+ kFP8 = 11,
};
//!
@@ -759,22 +865,13 @@ enum class PluginFieldType : int32_t
class PluginField
{
public:
- //!
- //! \brief Plugin field attribute name
- //!
+ //! Plugin field attribute name
AsciiChar const* name;
- //!
- //! \brief Plugin field attribute data
- //!
+ //! Plugin field attribute data
void const* data;
- //!
- //! \brief Plugin field attribute type
- //! \see PluginFieldType
- //!
+ //! Plugin field attribute type
PluginFieldType type;
- //!
- //! \brief Number of data entries in the Plugin attribute
- //!
+ //! Number of data entries in the Plugin attribute
int32_t length;
PluginField(AsciiChar const* const name_ = nullptr, void const* const data_ = nullptr,
@@ -787,7 +884,11 @@ class PluginField
}
};
-//! Plugin field collection struct.
+//!
+//! \struct PluginFieldCollection
+//!
+//! \brief Plugin field collection struct.
+//!
struct PluginFieldCollection
{
//! Number of PluginField entries.
@@ -797,33 +898,56 @@ struct PluginFieldCollection
};
//!
-//! \class IPluginCreator
+//! \enum PluginCapabilityType
//!
-//! \brief Plugin creator class for user implemented layers.
+//! \brief Enumerates the different capability types a IPluginV3 object may have
//!
-//! \see IPlugin and IPluginFactory
+enum class PluginCapabilityType : int32_t
+{
+ //! Core capability. Every IPluginV3 object must have this.
+ kCORE = 0,
+ //! Build capability. IPluginV3 objects provided to TensorRT build phase must have this.
+ kBUILD = 1,
+ //! Runtime capability. IPluginV3 objects provided to TensorRT build and execution phases must have this.
+ kRUNTIME = 2
+};
+
+//!
+//! \enum TensorRTPhase
//!
+//! \brief Indicates a phase of operation of TensorRT
+//!
+enum class TensorRTPhase : int32_t
+{
+ //! Build phase of TensorRT
+ kBUILD = 0,
+ //! Execution phase of TensorRT
+ kRUNTIME = 1
+};
-class IPluginCreator
+namespace v_1_0
+{
+class IPluginCreatorInterface : public IVersionedInterface
{
public:
- //!
- //! \brief Return the version of the API the plugin creator was compiled with.
- //!
- //! \usage
- //! - Allowed context for the API call
- //! - Thread-safe: Yes, the implementation provided here is safe to call from any thread.
- //!
- virtual int32_t getTensorRTVersion() const noexcept
- {
- return NV_TENSORRT_VERSION;
- }
+ ~IPluginCreatorInterface() noexcept override = default;
+
+protected:
+ IPluginCreatorInterface() = default;
+ IPluginCreatorInterface(IPluginCreatorInterface const&) = default;
+ IPluginCreatorInterface(IPluginCreatorInterface&&) = default;
+ IPluginCreatorInterface& operator=(IPluginCreatorInterface const&) & = default;
+ IPluginCreatorInterface& operator=(IPluginCreatorInterface&&) & = default;
+};
+class TRT_DEPRECATED IPluginCreator : public IPluginCreatorInterface
+{
+public:
//!
//! \brief Return the plugin name.
//!
- //! \warning The string returned must be 1024 bytes or less including the NULL terminator and must be NULL
- //! terminated.
+ //! \warning The string returned must be NULL-terminated and have a length of 1024 bytes or less including
+ //! the NULL terminator.
//!
//! \usage
//! - Allowed context for the API call
@@ -836,8 +960,8 @@ class IPluginCreator
//!
//! \brief Return the plugin version.
//!
- //! \warning The string returned must be 1024 bytes or less including the NULL terminator and must be NULL
- //! terminated.
+ //! \warning The string returned must be NULL-terminated and have a length of 1024 bytes or less including
+ //! the NULL terminator.
//!
//! \usage
//! - Allowed context for the API call
@@ -848,7 +972,8 @@ class IPluginCreator
virtual AsciiChar const* getPluginVersion() const noexcept = 0;
//!
- //! \brief Return a list of fields that needs to be passed to createPlugin.
+ //! \brief Return a list of fields that need to be passed to createPlugin.
+ //!
//! \see PluginFieldCollection
//!
//! \usage
@@ -862,6 +987,9 @@ class IPluginCreator
//!
//! \brief Return a plugin object. Return nullptr in case of error.
//!
+ //! \param name A NULL-terminated name string of length 1024 or less, including the NULL terminator.
+ //! \param fc A pointer to a collection of fields needed for constructing the plugin.
+ //!
//! \usage
//! - Allowed context for the API call
//! - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads
@@ -873,6 +1001,12 @@ class IPluginCreator
//!
//! \brief Called during deserialization of plugin layer. Return a plugin object.
//!
+ //! \param name A NULL-terminated name string of length 1024 or less, including the NULL terminator.
+ //! \param serialData The start address of a byte array with the serialized plugin representation.
+ //! \param serialLength The length in bytes of the byte array with the serialized plugin representation.
+ //!
+ //! \return A deserialized plugin object
+ //!
//! \usage
//! - Allowed context for the API call
//! - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads
@@ -886,6 +1020,8 @@ class IPluginCreator
//! \brief Set the namespace of the plugin creator based on the plugin
//! library it belongs to. This can be set while registering the plugin creator.
//!
+ //! \param pluginNamespace A NULL-terminated namespace string of length 1024 or less, including the NULL terminator
+ //!
//! \see IPluginRegistry::registerCreator()
//!
//! \usage
@@ -899,8 +1035,8 @@ class IPluginCreator
//!
//! \brief Return the namespace of the plugin creator object.
//!
- //! \warning The string returned must be 1024 bytes or less including the NULL terminator and must be NULL
- //! terminated.
+ //! \warning The string returned must be NULL-terminated and have a length of 1024 bytes or less including the
+ //! NULL terminator.
//!
//! \usage
//! - Allowed context for the API call
@@ -911,16 +1047,46 @@ class IPluginCreator
virtual AsciiChar const* getPluginNamespace() const noexcept = 0;
IPluginCreator() = default;
- virtual ~IPluginCreator() = default;
+ ~IPluginCreator() override = default;
protected:
-// @cond SuppressDoxyWarnings
+ // @cond SuppressDoxyWarnings
IPluginCreator(IPluginCreator const&) = default;
IPluginCreator(IPluginCreator&&) = default;
IPluginCreator& operator=(IPluginCreator const&) & = default;
IPluginCreator& operator=(IPluginCreator&&) & = default;
// @endcond
+public:
+ //!
+ //! \brief Return version information associated with this interface. Applications must not override this method.
+ //!
+ InterfaceInfo getInterfaceInfo() const noexcept override
+ {
+ return InterfaceInfo{"PLUGIN CREATOR_V1", 1, 0};
+ }
};
+} // namespace v_1_0
+
+//!
+//! \class IPluginCreatorInterface
+//!
+//! \brief Base class for all plugin creator versions.
+//!
+//! \see IPluginCreator and IPluginRegistry
+//!
+using IPluginCreatorInterface = v_1_0::IPluginCreatorInterface;
+
+//!
+//! \class IPluginCreator
+//!
+//! \brief Plugin creator class for user implemented layers.
+//!
+//! \see IPlugin and IPluginFactory
+//!
+//! \deprecated Deprecated in TensorRT 10.0. Please implement IPluginCreatorV3One instead along with IPluginV3 plugins
+//! instead.
+//!
+using IPluginCreator = v_1_0::IPluginCreator;
} // namespace nvinfer1
diff --git a/include/NvInferSafeRuntime.h b/include/NvInferSafeRuntime.h
index fbc5a6af..1c322c4e 100644
--- a/include/NvInferSafeRuntime.h
+++ b/include/NvInferSafeRuntime.h
@@ -61,14 +61,18 @@ class IRuntime
{
public:
//!
- //! \brief Deserialize an engine from a stream.
+ //! \brief Deserialize an engine from a byte array.
//!
//! If the serialized engine requires plugins the plugin creator must be registered by calling
- //! IPluginRegistry::registerCreator() before calling deserializeCudaEngine(). Every plugin creator
- //! registered must have a unique combination of namespace, plugin name, and version.
+ //! IPluginRegistry::registerCreator() before calling deserializeCudaEngine().
//!
- //! \param blob The memory that holds the serialized engine.
- //! \param size The size of the memory in bytes.
+ //! \param blob The memory that holds the serialized engine. The content must be a copy of
+ //! the result of calling IHostMemory::data() on a serialized plan that was created via calling
+ //! IBuilder::buildSerializedNetwork() on a network within the supported safety scope.
+ //! Additionally, it must have been validated via IConsistencyChecker::validate().
+ //!
+ //! \param size The size of the memory in bytes. This must be the result of calling IHostMemory::size()
+ //! on the same IHostMemory object that is associated with the blob parameter.
//!
//! \return The engine, or nullptr if it could not be deserialized.
//!
@@ -83,12 +87,9 @@ class IRuntime
//!
//! \brief Set the GPU allocator.
- //! \param allocator Set the GPU allocator to be used by the runtime. All GPU memory acquired will use this
- //! allocator. If NULL is passed, the default allocator will be used.
- //!
- //! Default: uses cudaMalloc/cudaFree.
//!
- //! If nullptr is passed, the default allocator will be used.
+ //! \param allocator The GPU allocator to be used by the runtime. All GPU memory acquired will use this
+ //! allocator. If nullptr is passed, the default allocator will be used, which calls cudaMalloc and cudaFree.
//!
//! \usage
//! - Allowed context for the API call
@@ -100,12 +101,13 @@ class IRuntime
//! \brief Set the ErrorRecorder for this interface.
//!
//! Assigns the ErrorRecorder to this interface. The ErrorRecorder will track all errors during execution.
- //! This function will call incRefCount of the registered ErrorRecorder at least once. Setting
- //! recorder to nullptr unregisters the recorder with the interface, resulting in a call to decRefCount if
- //! a recorder has been registered.
+ //! This function will call incRefCount of the registered ErrorRecorder at least once. If the recorder is set to
+ //! nullptr, an error code of ErrorCode::kINVALID_ARGUMENT will be emitted if the recorder has already been
+ //! registered, or ILogger::Severity::kERROR will be logged if the recorder has not yet been registered.
+ //!
+ //! \param recorder The error recorder to register with this interface, or nullptr to deregister the current
+ //! error recorder.
//!
- //! \param recorder The error recorder to register with this interface.
- //
//! \see getErrorRecorder()
//!
//! \usage
@@ -118,9 +120,10 @@ class IRuntime
//! \brief Get the ErrorRecorder assigned to this interface.
//!
//! Retrieves the assigned error recorder object for the given class. A default error recorder does not exist,
- //! so a nullptr will be returned if setErrorRecorder has not been called.
+ //! so a nullptr will be returned if setErrorRecorder has not been called or a previously assigned error recorder
+ //! has been deregistered.
//!
- //! \return A pointer to the IErrorRecorder object that has been registered.
+ //! \return A pointer to the IErrorRecorder object that has been registered, or nullptr if no error recorder is set.
//!
//! \see setErrorRecorder()
//!
@@ -148,119 +151,20 @@ class IRuntime
class ICudaEngine
{
public:
- //!
- //! \brief Get the number of binding indices.
- //!
- //! \return The number of binding indices.
- //!
- //! \deprecated Deprecated in TensorRT 8.5. Superseded by getNbIOTensors.
- //!
- //! \see getBindingIndex()
- //!
- //! \usage
- //! - Allowed context for the API call
- //! - Thread-safe: Yes
- //!
- TRT_DEPRECATED virtual std::int32_t getNbBindings() const noexcept = 0;
-
- //!
- //! \brief Retrieve the binding index for a named tensor.
- //!
- //! safe::IExecutionContext::enqueueV2() requires an array of buffers.
- //! Engine bindings map from tensor names to indices in this array.
- //! Binding indices are assigned at engine build time, and take values in the range [0 ... n-1] where n is the total
- //! number of inputs and outputs.
- //!
- //! \warning Strings passed to the runtime must be 1024 characters or less including NULL terminator and must be
- //! NULL terminated.
- //!
- //! \param name The tensor name.
- //! \return The binding index for the named tensor, or -1 if the name is not found.
- //!
- //! \deprecated Deprecated in TensorRT 8.5. Superseded by name-based methods. Use them instead of binding-index
- //! based methods.
- //!
- //! \usage
- //! - Allowed context for the API call
- //! - Thread-safe: Yes
- //!
- TRT_DEPRECATED virtual std::int32_t getBindingIndex(AsciiChar const* const name) const noexcept = 0;
-
- //!
- //! \brief Retrieve the name corresponding to a binding index.
- //!
- //! This is the reverse mapping to that provided by getBindingIndex().
- //!
- //! \param bindingIndex The binding index.
- //! \return The name corresponding to the index, or nullptr if the index is out of range.
- //!
- //! \deprecated Deprecated in TensorRT 8.5. Superseded by name-based methods. Use them instead of binding-index
- //! based methods.
- //!
- //! \see getBindingIndex()
- //!
- //! \usage
- //! - Allowed context for the API call
- //! - Thread-safe: Yes
- //!
- TRT_DEPRECATED virtual AsciiChar const* getBindingName(std::int32_t const bindingIndex) const noexcept = 0;
-
- //!
- //! \brief Determine whether a binding is an input binding.
- //!
- //! \param bindingIndex The binding index.
- //! \return True if the index corresponds to an input binding and the index is in range.
- //!
- //! \deprecated Deprecated in TensorRT 8.5. Superseded by tensorIOMode().
- //!
- //! \see safe::ICudaEngine::tensorIOMode()
- //!
- //! \usage
- //! - Allowed context for the API call
- //! - Thread-safe: Yes
- //!
- TRT_DEPRECATED virtual bool bindingIsInput(std::int32_t const bindingIndex) const noexcept = 0;
-
- //!
- //! \brief Get the dimensions of a binding.
- //!
- //! \param bindingIndex The binding index.
- //! \return The dimensions of the binding if the index is in range, otherwise Dims()
- //!
- //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorShape().
- //!
- //! \see safe::ICudaEngine::getTensorShape()
- //!
- //! \usage
- //! - Allowed context for the API call
- //! - Thread-safe: Yes
- //!
- TRT_DEPRECATED virtual Dims getBindingDimensions(std::int32_t const bindingIndex) const noexcept = 0;
-
- //!
- //! \brief Determine the required data type for a buffer from its binding index.
- //!
- //! \param bindingIndex The binding index.
- //! \return The type of the data in the buffer.
- //!
- //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorDataType().
- //!
- //! \see safe::ICudaEngine::getTensorDataType()
- //!
- //! \usage
- //! - Allowed context for the API call
- //! - Thread-safe: Yes
- //!
- TRT_DEPRECATED virtual DataType getBindingDataType(std::int32_t const bindingIndex) const noexcept = 0;
-
//!
//! \brief Create an execution context.
//!
//! \see safe::IExecutionContext.
//!
+ //! \return An execution context object if it can be constructed, or nullptr if the construction fails.
+ //!
+ //! \details Reasons for failure may include but not be limited to:
+ //! - Heap memory exhaustion
+ //! - Device memory exhaustion
+ //!
//! \usage
//! - Allowed context for the API call
- //! - Thread-safe: Yes; if createExecutionContext fails, users should treat this as a critical
+ //! - Thread-safe: Yes; if createExecutionContext fails, users must treat this as a critical
//! error and not perform any subsequent TensorRT operations apart from outputting
//! the error logs.
//!
@@ -269,13 +173,18 @@ class ICudaEngine
//!
//! \brief Create an execution context without any device memory allocated.
//!
- //! The memory for execution of this device context must be supplied by the application.
+ //! The memory for execution of this device context must be supplied by the application by calling
+ //! safe::IExecutionContext::setDeviceMemory().
//!
//! \see getDeviceMemorySize() safe::IExecutionContext::setDeviceMemory()
//!
+ //! \return An execution context object if it can be constructed, or nullptr if the construction fails.
+ //!
+ //! \details Reasons for failure may include but not be limited to heap memory exhaustion.
+ //!
//! \usage
//! - Allowed context for the API call
- //! - Thread-safe: Yes; if createExecutionContext fails, users should treat this as a critical
+ //! - Thread-safe: Yes; if createExecutionContext fails, users must treat this as a critical
//! error and not perform any subsequent TensorRT operations apart from outputting
//! the error logs.
//!
@@ -286,77 +195,15 @@ class ICudaEngine
//!
//! \see safe::IExecutionContext::setDeviceMemory()
//!
- //! \usage
- //! - Allowed context for the API call
- //! - Thread-safe: Yes
- //!
- virtual size_t getDeviceMemorySize() const noexcept = 0;
-
- //!
- //! \brief Return the number of bytes per component of an element.
- //!
- //! The vector component size is returned if getBindingVectorizedDim() != -1.
- //!
- //! \param bindingIndex The binding Index.
- //!
- //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorBytesPerComponent().
- //!
- //! \see safe::ICudaEngine::getTensorBytesPerComponent()
- //!
- //! \usage
- //! - Allowed context for the API call
- //! - Thread-safe: Yes
- //!
- TRT_DEPRECATED virtual std::int32_t getBindingBytesPerComponent(std::int32_t const bindingIndex) const noexcept = 0;
-
- //!
- //! \brief Return the number of components included in one element.
- //!
- //! The number of elements in the vectors is returned if getBindingVectorizedDim() != -1.
- //!
- //! \param bindingIndex The binding Index.
- //!
- //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorComponentsPerElement().
- //!
- //! \see safe::ICudaEngine::getTensorComponentsPerElement()
- //!
- //! \usage
- //! - Allowed context for the API call
- //! - Thread-safe: Yes
- //!
- TRT_DEPRECATED virtual std::int32_t getBindingComponentsPerElement(std::int32_t const bindingIndex) const noexcept = 0;
-
- //!
- //! \brief Return the binding format.
- //!
- //! \param bindingIndex The binding Index.
- //!
- //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorFormat().
- //!
- //! \see safe::ICudaEngine::getTensorFormat()
- //!
- //! \usage
- //! - Allowed context for the API call
- //! - Thread-safe: Yes
- //!
- TRT_DEPRECATED virtual TensorFormat getBindingFormat(std::int32_t const bindingIndex) const noexcept = 0;
-
- //!
- //! \brief Return the dimension index that the buffer is vectorized.
- //!
- //! Specifically -1 is returned if scalars per vector is 1.
- //!
- //! \param bindingIndex The binding Index.
- //!
- //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorVectorizedDim().
- //!
- //! \see safe::ICudaEngine::getTensorVectorizedDim()
+ //! \return Size of a contiguous memory buffer (in bytes) that users need to provide to
+ //! safe::IExecutionContext::setDeviceMemory() if the execution context has been created by calling
+ //! createExecutionContextWithoutDeviceMemory().
//!
//! \usage
//! - Allowed context for the API call
//! - Thread-safe: Yes
//!
- TRT_DEPRECATED virtual std::int32_t getBindingVectorizedDim(std::int32_t const bindingIndex) const noexcept = 0;
+ virtual size_t getDeviceMemorySize() const noexcept = 0;
//!
//! \brief Returns the name of the network associated with the engine.
@@ -366,7 +213,8 @@ class ICudaEngine
//!
//! \see INetworkDefinition::setName(), INetworkDefinition::getName()
//!
- //! \return A null-terminated C-style string representing the name of the network.
+ //! \return A NULL-terminated C-style string representing the name of the network, which will have a length of
+ //! 1024 bytes or less including the NULL terminator.
//!
//! \usage
//! - Allowed context for the API call
@@ -378,12 +226,12 @@ class ICudaEngine
//! \brief Set the ErrorRecorder for this interface.
//!
//! Assigns the ErrorRecorder to this interface. The ErrorRecorder will track all errors during execution.
- //! This function will call incRefCount of the registered ErrorRecorder at least once. Setting
- //! recorder to nullptr unregisters the recorder with the interface, resulting in a call to decRefCount if
- //! a recorder has been registered.
+ //! This function will call incRefCount of the registered ErrorRecorder at least once. If the recorder is set to
+ //! nullptr, the error code ErrorCode::kINVALID_ARGUMENT will be emitted if the recorder has been registered.
+ //!
+ //! \param recorder The error recorder to register with this interface, or nullptr to deregister the current.
+ //! error recorder.
//!
- //! \param recorder The error recorder to register with this interface.
- //
//! \see getErrorRecorder()
//!
//! \usage
@@ -399,7 +247,8 @@ class ICudaEngine
//! nullptr will be returned if an error reporter has not been inherited
//! from the IRuntime, and setErrorReporter() has not been called.
//!
- //! \return A pointer to the IErrorRecorder object that has been registered.
+ //! \return A pointer to the IErrorRecorder object that has been registered, or nullptr if none has been
+ //! registered.
//!
//! \see setErrorRecorder()
//!
@@ -417,71 +266,77 @@ class ICudaEngine
ICudaEngine& operator=(ICudaEngine&&) & = delete;
//!
- //! \brief Get extent of an input or output tensor.
+ //! \brief Get the extent of an input or output tensor.
+ //!
//! \param tensorName The name of an input or output tensor.
//!
- //! \warning The string tensorName must be 1024 characters or less including NULL terminator and must be
- //! NULL terminated.
+ //! \warning The string tensorName must be NULL terminated and have a length of 1024 bytes or less including the
+ //! NULL terminator.
//!
- //! \return Extent of the tensor. Dims{-1, {}} will be returned if
- //! (1) name is not the name of an input or output tensor, or
- //! (2) name is nullptr, or
- //! (3) name exceeds the string length limit.
+ //! \return Extent of the tensor. The invalid value Dims{-1, {}} will be returned if
+ //! - name is not the name of an input or output tensor, or
+ //! - name is nullptr, or
+ //! - name exceeds the string length limit.
//!
//! \usage
//! - Allowed context for the API call
//! - Thread-safe: Yes
//!
- virtual Dims getTensorShape(AsciiChar const* tensorName) const noexcept = 0;
+ virtual Dims getTensorShape(AsciiChar const* const tensorName) const noexcept = 0;
//!
//! \brief Determine the required data type for a buffer from its tensor name.
//!
//! \param tensorName The name of an input or output tensor.
//!
- //! \warning The string tensorName must be 1024 characters or less including NULL terminator and must be
- //! NULL terminated.
+ //! \warning The string tensorName must be NULL terminated and have a length of 1024 bytes or less including the
+ //! NULL terminator.
//!
- //! \return The type of the data in the buffer. DataType::kFLOAT will be returned if
- //! (1) name is not the name of an input or output tensor, or
- //! (2) name is nullptr, or
- //! (3) name exceeds the string length limit.
+ //! \return The type of the data in the buffer. The default value DataType::kFLOAT will be returned if
+ //! - name is not the name of an input or output tensor, or
+ //! - name is nullptr, or
+ //! - name exceeds the string length limit.
//!
//! \usage
//! - Allowed context for the API call
//! - Thread-safe: Yes
//!
- virtual DataType getTensorDataType(AsciiChar const* tensorName) const noexcept = 0;
+ virtual DataType getTensorDataType(AsciiChar const* const tensorName) const noexcept = 0;
//!
//! \brief Determine whether a tensor is an input or output tensor.
//!
//! \param tensorName The name of an input or output tensor.
//!
- //! \warning The string tensorName must be 1024 characters or less including NULL terminator and must be
- //! NULL terminated.
+ //! \warning The string tensorName must be NULL terminated and have a length of 1024 bytes or less including the
+ //! NULL terminator.
//!
- //! \return kINPUT if tensorName is an input, kOUTPUT if tensorName is an output, or kNONE if neither.
+ //! \return kINPUT if tensorName is the name of an input tensor, kOUTPUT if tensorName is the name of an output
+ //! tensor. The invalid value kNONE is returned if
+ //! - tensorName exceeds the string length limit, or
+ //! - tensorName is nullptr, or
+ //! - tensorName does not correspond to any input or output tensor.
//!
//! \usage
//! - Allowed context for the API call
//! - Thread-safe: Yes
//!
- virtual TensorIOMode getTensorIOMode(AsciiChar const* tensorName) const noexcept = 0;
+ virtual TensorIOMode getTensorIOMode(AsciiChar const* const tensorName) const noexcept = 0;
//!
- //! \brief Return the number of bytes per component of an element.
+ //! \brief Return the size of the tensor data type in bytes for a vectorized tensor.
//!
//! \param tensorName The name of an input or output tensor.
//!
- //! \warning The string tensorName must be 1024 characters or less including NULL terminator and must be
- //! NULL terminated.
+ //! \warning The string tensorName must be NULL terminated and have a length of 1024 bytes or less including the
+ //! NULL terminator.
//!
- //! \return The vector component size. 0 will be returned if
- //! (1) name is not the name of an input or output tensor, or
- //! (2) name is nullptr, or
- //! (3) name exceeds the string length limit, or
- //! (4) the tensor of given name is not vectorized.
+ //! \return The size of the tensor data type in bytes if the tensor is vectorized (4 for float and int32,
+ //! 2 for half, 1 for int8). 0 will be returned if
+ //! - name is not the name of an input or output tensor, or
+ //! - name is nullptr, or
+ //! - name exceeds the string length limit, or
+ //! - the tensor of the given name is not vectorized.
//!
//! \see safe::ICudaEngine::getTensorVectorizedDim()
//!
@@ -489,21 +344,21 @@ class ICudaEngine
//! - Allowed context for the API call
//! - Thread-safe: Yes
//!
- virtual std::int32_t getTensorBytesPerComponent(AsciiChar const* tensorName) const noexcept = 0;
+ virtual std::int32_t getTensorBytesPerComponent(AsciiChar const* const tensorName) const noexcept = 0;
//!
- //! \brief Return the number of components included in one element.
+ //! \brief Return the number of components included in one element for a vectorized tensor.
//!
//! \param tensorName The name of an input or output tensor.
//!
- //! \warning The string tensorName must be 1024 characters or less including NULL terminator and must be
- //! NULL terminated.
+ //! \warning The string tensorName must be NULL terminated and have a length of 1024 bytes or less including the
+ //! NULL terminator.
//!
- //! \return The vector component size. -1 will be returned if
- //! (1) name is not the name of an input or output tensor, or
- //! (2) name is nullptr, or
- //! (3) name exceeds the string length limit, or
- //! (4) the tensor of given name is not vectorized.
+ //! \return The vector length (in scalars) for a vectorized tensor, or 1 for a scalar tensor.
+ //! The invalid value -1 will be returned if
+ //! - name is not the name of an input or output tensor, or
+ //! - name is nullptr, or
+ //! - name exceeds the string length limit.
//!
//! \see safe::ICudaEngine::getTensorVectorizedDim()
//!
@@ -511,48 +366,48 @@ class ICudaEngine
//! - Allowed context for the API call
//! - Thread-safe: Yes
//!
- virtual std::int32_t getTensorComponentsPerElement(AsciiChar const* tensorName) const noexcept = 0;
+ virtual std::int32_t getTensorComponentsPerElement(AsciiChar const* const tensorName) const noexcept = 0;
//!
//! \brief Return the tensor format.
//!
//! \param tensorName The name of an input or output tensor.
//!
- //! \warning The string tensorName must be 1024 characters or less including NULL terminator and must be
- //! NULL terminated.
+ //! \warning The string tensorName must be NULL terminated and have a length of 1024 bytes or less including the
+ //! NULL terminator.
//!
//! \return The tensor format. TensorFormat::kLINEAR will be returned if
- //! (1) name is not the name of an input or output tensor, or
- //! (2) name is nullptr, or
- //! (3) name exceeds the string length limit.
+ //! - name is not the name of an input or output tensor, or
+ //! - name is nullptr, or
+ //! - name exceeds the string length limit.
//!
//! \usage
//! - Allowed context for the API call
//! - Thread-safe: Yes
//!
- virtual TensorFormat getTensorFormat(AsciiChar const* tensorName) const noexcept = 0;
+ virtual TensorFormat getTensorFormat(AsciiChar const* const tensorName) const noexcept = 0;
//!
- //! \brief Return the dimension index along which buffer is vectorized.
+ //! \brief Return the dimension index along which the buffer is vectorized.
//!
- //! Specifically -1 is returned if scalars per vector is 1.
+ //! Specifically, -1 is returned if the tensor is scalar.
//!
//! \param tensorName The name of an input or output tensor.
//!
- //! \warning The string tensorName must be 1024 characters or less including NULL terminator and must be
- //! NULL terminated.
+ //! \warning The string tensorName must be NULL terminated and have a length of 1024 bytes or less including the
+ //! NULL terminator.
//!
//! \return The dimension index along which the buffer is vectorized. -1 will be returned if
- //! (1) name is not the name of an input or output tensor, or
- //! (2) name is nullptr, or
- //! (3) name exceeds the string length limit, or
- //! (4) the tensor of given name is not vectorized.
+ //! - name is not the name of an input or output tensor, or
+ //! - name is nullptr, or
+ //! - name exceeds the string length limit (1024 bytes or less including the NULL terminator), or
+ //! - the tensor of given name is not vectorized.
//!
//! \usage
//! - Allowed context for the API call
//! - Thread-safe: Yes
//!
- virtual std::int32_t getTensorVectorizedDim(AsciiChar const* tensorName) const noexcept = 0;
+ virtual std::int32_t getTensorVectorizedDim(AsciiChar const* const tensorName) const noexcept = 0;
//!
//! \brief Return the number of input and output tensors for the network from which the engine was built.
@@ -570,13 +425,14 @@ class ICudaEngine
//!
//! \brief Return the name of an IO tensor.
//!
- //! If the index does not fall between 0 and getNbIOTensors()-1, the function will fail with an error code of ErrorCode::kINVALID_ARGUMENT(3) that is
- //! emitted to the registered IErrorRecorder.
+ //! If the index does not fall between 0 and getNbIOTensors()-1, the function will fail with an error code
+ //! of ErrorCode::kINVALID_ARGUMENT(3) that is emitted to the registered IErrorRecorder.
//!
- //! \param index The value that falls between 0 and getNbIOTensors()-1.
+ //! \param index The IO tensor index.
//!
- //! \return The name of an IO tensor. nullptr will be returned if the index does not fall between 0 and
- //! getNbIOTensors()-1.
+ //! \return The name of an IO tensor, which will be a NULL-terminated string of 1024 bytes or less (including the
+ //! NULL terminator) if the index is in the range (between 0 and getNbIOTensors()-1). nullptr will be returned if
+ //! the index is not in range.
//!
//! \see getNbIOTensors()
//!
@@ -588,17 +444,35 @@ class ICudaEngine
};
//!
-//! \brief Space to record information about floating point runtime errors
+//! \brief Space to record information about runtime errors.
+//!
+//! kNAN_CONSUMED errors occur when NAN values are stored in an INT8 quantized datatype.
+//! kINF_CONSUMED errors occur when +-INF values are stored in an INT8 quantized datatype.
+//! kGATHER_OOB errors occur when a gather index tensor contains a value that is outside of the data tensor.
+//! kSCATTER_OOB and kSCATTER_RACE are reserved for future use.
+//!
+//! Mark the RuntimeErrorType that occurs during asynchronous kernel execution.
+struct RuntimeErrorInformation
+{
+ //! Each bit represents a RuntimeErrorType that has occurred during kernel execution.
+ uint64_t bitMask;
+};
+
//!
-//! NAN errors occur when NAN values are stored in an INT8 quantized datatype.
-//! INF errors occur when +-INF values are stored in an INT8 quantized datatype.
+//! \brief Enum to represent runtime error types.
//!
-struct FloatingPointErrorInformation
+enum class RuntimeErrorType : uint64_t
{
- //! Total count of errors relating to NAN values (0 if none)
- int32_t nbNanErrors;
- //! Total count of errors relating to INF values (0 if none)
- int32_t nbInfErrors;
+ //! NaN floating-point value was silently consumed
+ kNAN_CONSUMED = 1ULL << 0,
+ //! Inf floating-point value was silently consumed
+ kINF_CONSUMED = 1ULL << 1,
+ //! Out-of-bounds access in gather operation
+ kGATHER_OOB = 1ULL << 2,
+ //! Out-of-bounds access in scatter operation
+ kSCATTER_OOB = 1ULL << 3,
+ //! Race condition in scatter operation
+ kSCATTER_RACE = 1ULL << 4,
};
//!
@@ -633,8 +507,9 @@ class IExecutionContext
//!
//! This method copies the name string.
//!
- //! \warning Strings passed to the runtime must be 1024 characters or less including NULL terminator and must be
- //! NULL terminated.
+ //! \warning Strings passed to the runtime must be NULL terminated and have a length of 1024 bytes or less
+ //! including the NULL terminator. Otherwise, the operation will not change the execution context name, and
+ //! an error message will be recorded via the error recorder.
//!
//! \see getName()
//!
@@ -647,6 +522,9 @@ class IExecutionContext
//!
//! \brief Return the name of the execution context.
//!
+ //! \return The name that was passed to setName(), as a NULL-terminated string of 1024 bytes or less including
+ //! the NULL terminator. An empty string will be returned as the default value.
+ //!
//! \see setName()
//!
//! \usage
@@ -658,12 +536,18 @@ class IExecutionContext
//!
//! \brief Set the device memory for use by this execution context.
//!
+ //! \param memory The start address of a device memory buffer whose size in bytes must be at least the value
+ //! returned by getEngine().getDeviceMemorySize().
+ //!
//! If using enqueueV2() to run the network, The memory is in use
//! from the invocation of enqueueV2() until network execution is complete.
//! Releasing or otherwise using the memory for other purposes during this time will result in undefined behavior.
//!
//! \warning Do not release or use for other purposes the memory set here during network execution.
//!
+ //! \warning If the execution context has been created by calling createExecutionContext(), this
+ //! function must not be used and will fail with an error message if called.
+ //!
//! \see safe::ICudaEngine::getDeviceMemorySize() safe::ICudaEngine::createExecutionContextWithoutDeviceMemory()
//!
//! \usage
@@ -672,31 +556,17 @@ class IExecutionContext
//!
virtual void setDeviceMemory(void* const memory) noexcept = 0;
- //!
- //! \brief Return the strides of the buffer for the given binding.
- //!
- //! \param bindingIndex The binding index.
- //!
- //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorStrides().
- //!
- //! \see safe::IExecutionContext::getTensorStrides()
- //!
- //! \usage
- //! - Allowed context for the API call
- //! - Thread-safe: Yes
- //!
- TRT_DEPRECATED virtual Dims getStrides(std::int32_t const bindingIndex) const noexcept = 0;
-
//!
//! \brief Set the ErrorRecorder for this interface.
//!
//! Assigns the ErrorRecorder to this interface. The ErrorRecorder will track all errors during execution.
- //! This function will call incRefCount of the registered ErrorRecorder at least once. Setting
- //! recorder to nullptr unregisters the recorder with the interface, resulting in a call to decRefCount if
- //! a recorder has been registered.
+ //! This function will call incRefCount of the registered ErrorRecorder at least once. If the recorder is set to
+ //! nullptr, the error code ErrorCode::kINVALID_ARGUMENT will be emitted if the recorder has been registered. The
+ //! lifetime of the error recorder object must exceed the lifetime of the execution context.
+ //!
+ //! \param recorder Either a pointer to a valid error recorder object to register with this interface,
+ //! or nullptr to deregister the current recorder.
//!
- //! \param recorder The error recorder to register with this interface.
- //
//! \see getErrorRecorder()
//!
//! \usage
@@ -711,40 +581,17 @@ class IExecutionContext
//! Retrieves the assigned error recorder object for the given class. A default error recorder does not exist,
//! so a nullptr will be returned if setErrorRecorder has not been called.
//!
- //! \return A pointer to the IErrorRecorder object that has been registered.
+ //! \return A pointer to the IErrorRecorder object that has been registered, or nullptr if the error recorder
+ //! has been deregistered or not set.
//!
//! \see setErrorRecorder()
//!
//! \usage
//! - Allowed context for the API call
- //! - Thread-safe: No
+ //! - Thread-safe: Yes
//!
virtual IErrorRecorder* getErrorRecorder() const noexcept = 0;
- //!
- //! \brief Enqueue inference of a batch on a stream.
- //!
- //! This method requires an array of input and output buffers. The mapping from tensor names to indices can be
- //! queried using safe::ICudaEngine::getBindingIndex().
- //! This method only works for an execution context built from a network without an implicit batch dimension.
- //! \param bindings An array of pointers to input and output buffers for the network.
- //! \param stream A cuda stream on which the inference kernels will be enqueued.
- //! \param inputConsumed An optional event which will be signaled when the input buffers can be refilled with new
- //! data.
- //!
- //! \return True if the kernels were enqueued successfully.
- //!
- //! \deprecated Deprecated in TensorRT 8.5. Superseded by enqueueV3().
- //!
- //! \see safe::IExecutionContext::enqueueV3()
- //!
- //! \usage
- //! - Allowed context for the API call
- //! - Thread-safe: No
- //!
- TRT_DEPRECATED virtual bool enqueueV2(
- void* const* const bindings, cudaStream_t const stream, cudaEvent_t const* const inputConsumed) noexcept = 0;
-
IExecutionContext() = default;
virtual ~IExecutionContext() noexcept = default;
IExecutionContext(IExecutionContext const&) = delete;
@@ -753,17 +600,18 @@ class IExecutionContext
IExecutionContext& operator=(IExecutionContext&&) & = delete;
//!
- //! \brief Set error buffer output for floating point errors.
+ //! \brief Set error buffer output for runtime errors.
//!
//! The error buffer output must be allocated in device memory and will be used for subsequent
- //! calls to enqueueV2. Checking the contents of the error buffer after inference is the responsibility
- //! of the application. The pointer passed here must have alignment adequate for the FloatingPointErrorInformation
- //! struct.
+ //! calls to enqueueV2() or enqueueV3(). Checking the contents of the error buffer after inference is the
+ //! responsibility of the application. The pointer passed here must have alignment adequate for the
+ //! RuntimeErrorInformation struct.
//!
- //! \warning Do not release or use the contents of the error buffer for any other purpose before synchronizing
- //! on the CUDA stream passed to enqueueV2.
+ //! \warning The buffer is written if reportable errors are encountered during network execution. Releasing the
+ //! buffer before network execution is complete will result in undefined behavior. Accessing the memory before
+ //! network execution is complete may not correctly capture the error state.
//!
- //! \param buffer The device memory to use as floating point error buffer
+ //! \param buffer The device memory address of the runtime error information buffer.
//!
//! \see getErrorBuffer()
//!
@@ -771,12 +619,12 @@ class IExecutionContext
//! - Allowed context for the API call
//! - Thread-safe: No
//!
- virtual void setErrorBuffer(FloatingPointErrorInformation* const buffer) noexcept = 0;
+ virtual void setErrorBuffer(RuntimeErrorInformation* const buffer) noexcept = 0;
//!
- //! \brief Get error buffer output for floating point errors.
+ //! \brief Get error buffer output for runtime errors.
//!
- //! \return Pointer to device memory to use as floating point error buffer or nullptr if not set.
+ //! \return Pointer to device memory to use as runtime error buffer or nullptr if not set.
//!
//! \see setErrorBuffer()
//!
@@ -784,29 +632,30 @@ class IExecutionContext
//! - Allowed context for the API call
//! - Thread-safe: Yes
//!
- virtual FloatingPointErrorInformation* getErrorBuffer() const noexcept = 0;
+ virtual RuntimeErrorInformation* getErrorBuffer() const noexcept = 0;
//!
//! \brief Return the strides of the buffer for the given tensor name.
//!
//! The strides are in units of elements, not components or bytes.
+ //! Elements are vectors (for a vectorized format) or scalars (for a scalar format).
//! For example, for TensorFormat::kHWC8, a stride of one spans 8 scalars.
//!
//! \param tensorName The name of an input or output tensor.
//!
- //! \warning The string tensorName must be 1024 characters or less including NULL terminator and must be
- //! NULL terminated.
+ //! \warning The string tensorName must be NULL terminated and have a length of 1024 bytes or less
+ //! including the NULL terminator.
//!
//! \return The strides of the buffer for the given tensor name. Dims{-1, {}} will be returned if
- //! (1) name is not the name of an input or output tensor, or
- //! (2) name is nullptr, or
- //! (3) name exceeds the string length limit.
+ //! - name is not the name of an input or output tensor, or
+ //! - name is nullptr, or
+ //! - name exceeds the string length limit.
//!
//! \usage
//! - Allowed context for the API call
//! - Thread-safe: Yes
//!
- virtual Dims getTensorStrides(AsciiChar const* tensorName) const noexcept = 0;
+ virtual Dims getTensorStrides(AsciiChar const* const tensorName) const noexcept = 0;
//!
//! \brief Set memory address for given input tensor.
@@ -816,23 +665,27 @@ class IExecutionContext
//! Before calling enqueueV3(), each input must have a non-null address.
//!
//! \param tensorName The name of an input tensor.
- //! \param data The pointer (void const*) to the const data owned by the user.
+ //! \param data The pointer (void const*) to the input tensor data, which is device memory owned by the user.
+ //! Users are responsible for ensuring that the buffer size has at least the expected length, which is
+ //! the product of the tensor dimensions (with the vectorized dimension padded to a multiple of the vector length)
+ //! times the data type size.
+ //!
+ //! \warning The string tensorName must be NULL terminated and have a length of 1024 bytes or less
+ //! including the NULL terminator.
//!
- //! \warning The string tensorName must be 1024 characters or less including NULL terminator and must be
- //! NULL terminated.
- //! \warning The pointer must have at least 256-byte alignment.
+ //! \warning The data pointer must have 256-byte alignment.
//!
//! \return True on success, false if
- //! (1) name is not the name of an input tensor, or
- //! (2) name is nullptr, or
- //! (3) name exceeds the string length limit, or
- //! (4) pointer to the const data is nullptr or not aligned.
+ //! - name is not the name of an input tensor, or
+ //! - name is nullptr, or
+ //! - name exceeds the string length limit, or
+ //! - pointer to the const data is nullptr or not correctly aligned.
//!
//! \usage
//! - Allowed context for the API call
//! - Thread-safe: No
//!
- virtual bool setInputTensorAddress(AsciiChar const* tensorName, void const* data) noexcept = 0;
+ virtual bool setInputTensorAddress(AsciiChar const* const tensorName, void const* const data) noexcept = 0;
//!
//! \brief Set memory address for given output tensor.
@@ -842,43 +695,48 @@ class IExecutionContext
//! Before calling enqueueV3(), each output must have a non-null address.
//!
//! \param tensorName The name of an output tensor.
- //! \param data The pointer (void*) to the data owned by the user.
+ //! \param data The pointer (void*) to the output tensor data, which is device memory owned by the user.
+ //! Users are responsible for ensuring that the buffer size has at least the expected length, which is
+ //! the product of the tensor dimensions (with the vectorized dimension padded to a multiple of the vector length)
+ //! times the data type size.
//!
- //! \warning The string tensorName must be 1024 characters or less including NULL terminator and must be
- //! NULL terminated.
- //! \warning The pointer must have at least 256-byte alignment.
+ //! \warning The string tensorName must be NULL terminated and have a length of 1024 bytes or less
+ //! including the NULL terminator.
+ //! \warning The data pointer must have 256-byte alignment.
//!
//! \return True on success. Return false if
- //! (1) name is not the name of an output tensor, or
- //! (2) name is nullptr, or
- //! (3) name exceeds the string length limit, or
- //! (4) pointer to data is nullptr or not aligned.
+ //! - name is not the name of an output tensor, or
+ //! - name is nullptr, or
+ //! - name exceeds the string length limit, or
+ //! - pointer to data is nullptr or not aligned.
//!
//! \usage
//! - Allowed context for the API call
//! - Thread-safe: No
//!
- virtual bool setOutputTensorAddress(AsciiChar const* tensorName, void* data) noexcept = 0;
+ virtual bool setOutputTensorAddress(AsciiChar const* const tensorName, void* const data) noexcept = 0;
//!
- //! \brief Mark input as consumed.
+ //! \brief Set the event to mark inputs as consumed.
//!
//! Passing event==nullptr removes whatever event was set, if any.
//!
- //! \param event The cuda event that is triggered after all input tensors have been consumed.
+ //! \param event The CUDA event that is signaled after all input tensors have been consumed, or nullptr to remove
+ //! an event that was previously set.
//!
- //! \return True on success, false if error occurred.
+ //! \return True on success, false if an error occurred.
//!
//! \usage
//! - Allowed context for the API call
//! - Thread-safe: No
//!
- virtual bool setInputConsumedEvent(cudaEvent_t event) noexcept = 0;
+ virtual bool setInputConsumedEvent(cudaEvent_t const event) noexcept = 0;
//!
//! \brief Return the event associated with consuming the input.
//!
- //! \return The cuda event, nullptr will be returned if the event is not set yet.
+ //! \return The CUDA event that was passed to setInputConsumedEvent(). nullptr will be returned if the event is
+ //! not set.
//!
//! \usage
//! - Allowed context for the API call
@@ -891,57 +749,61 @@ class IExecutionContext
//!
//! \param tensorName The name of an input tensor.
//!
- //! \warning The string tensorName must be 1024 characters or less including NULL terminator and must be
- //! NULL terminated.
+ //! \warning The string tensorName must be NULL terminated and have a length of 1024 bytes or less
+ //! including the NULL terminator.
//!
- //! \return The memory address for the given input tensor. nullptr will be returned if
- //! (1) name is not the name of an input tensor, or
- //! (2) name is nullptr, or
- //! (3) name exceeds the string length limit, or
- //! (4) the memory address for the given input tensor is not set yet.
+ //! \return The device memory address for the given input tensor. nullptr will be returned if
+ //! - name is not the name of an input tensor, or
+ //! - name is nullptr, or
+ //! - name exceeds the string length limit, or
+ //! - the memory address for the given input tensor is not set.
//!
//! \usage
//! - Allowed context for the API call
//! - Thread-safe: Yes
//!
- virtual void const* getInputTensorAddress(AsciiChar const* tensorName) const noexcept = 0;
+ virtual void const* getInputTensorAddress(AsciiChar const* const tensorName) const noexcept = 0;
//!
//! \brief Get memory address for given output tensor.
//!
//! \param tensorName The name of an output tensor.
//!
- //! \warning The string tensorName must be 1024 characters or less including NULL terminator and must be
- //! NULL terminated.
+ //! \warning The string tensorName must be NULL terminated and have a length of 1024 bytes or less
+ //! including the NULL terminator.
//!
- //! \return Raw output data pointer (void*) for given output tensor, return nullptr if
- //! (1) name is not the name of an output tensor, or
- //! (2) name is nullptr, or
- //! (3) name exceeds the string length limit, or
- //! (4) the memory address for the given output tensor is not set yet.
+ //! \return The device memory address for the given output tensor. Return nullptr if
+ //! - name is not the name of an output tensor, or
+ //! - name is nullptr, or
+ //! - name exceeds the string length limit, or
+ //! - the memory address for the given output tensor is not set.
//!
//! \usage
//! - Allowed context for the API call
//! - Thread-safe: Yes
//!
- virtual void* getOutputTensorAddress(AsciiChar const* tensorName) const noexcept = 0;
+ virtual void* getOutputTensorAddress(AsciiChar const* const tensorName) const noexcept = 0;
//!
//! \brief Enqueue inference on a stream.
//!
//! Modifying or releasing memory that has been registered for the tensors before stream
- //! synchronization or the event passed to setInputConsumedEvent has been being triggered results in undefined
+ //! synchronization or the event passed to setInputConsumedEvent has been signaled results in undefined
//! behavior.
//!
- //! \param stream A cuda stream on which the inference kernels will be enqueued.
+ //! \param stream A CUDA stream on which the inference kernels will be enqueued.
//!
//! \return True on success, false if any execution error occurred.
+ //! Errors may include but not be limited to:
+ //! - Internal errors during executing one engine layer
+ //! - CUDA errors
+ //! - Some input or output tensor addresses have not been set.
//!
//! \usage
//! - Allowed context for the API call
//! - Thread-safe: Yes
//!
- virtual bool enqueueV3(cudaStream_t stream) noexcept = 0;
+ virtual bool enqueueV3(cudaStream_t const stream) noexcept = 0;
};
//!
@@ -952,7 +814,7 @@ class IExecutionContext
//! Internally, the plugin registry is considered to be a singleton so all
//! plugins in an application are part of the same global registry.
//! Note that the plugin registry is only supported for plugins of type
-//! IPluginV2 and should also have a corresponding IPluginCreator implementation.
+//! IPluginV2 and must also have a corresponding IPluginCreator implementation.
//!
//! \see IPluginV2 and IPluginCreator
//!
@@ -966,11 +828,23 @@ class IPluginRegistry
{
public:
//!
- //! \brief Register a plugin creator. Returns false if one with same type
- //! is already registered.
+ //! \brief Register a plugin creator.
//!
- //! \warning The string pluginNamespace must be 1024 bytes or less including the NULL terminator and must be NULL
- //! terminated.
+ //! \param creator The plugin creator to be registered.
+ //!
+ //! \param pluginNamespace A NULL-terminated namespace string, which must be 1024 bytes or less including the NULL
+ //! terminator. It must be identical with the result of calling
+ //! IPluginCreator::getPluginNamespace() on the creator object.
+ //!
+ //! \return True if the registration succeeded, else false.
+ //!
+ //! \details Registration may fail for any of the following reasons:
+ //! - The pluginNamespace string is a nullptr.
+ //! - The pluginNamespace string exceeds the maximum length.
+ //! - The pluginNamespace string does not match the result of creator.getPluginNamespace().
+ //! - There have already been 100 plugin creators registered (maximum number of plugins exceeded).
+ //! - Another plugin creator with the same combination of plugin name, version and namespace has already been
+ //! registered.
//!
//! \usage
//! - Allowed context for the API call
@@ -980,7 +854,12 @@ class IPluginRegistry
//!
//! \brief Return all the registered plugin creators and the number of
- //! registered plugin creators. Returns nullptr if none found.
+ //! registered plugin creators. Returns nullptr if none is found.
+ //!
+ //! \param[out] numCreators If the call completes successfully, the number of registered plugin creators (which
+ //! will be an integer between 0 and 100 inclusive)
+ //! \return The start address of an IPluginCreator* array of length numCreators if at least one plugin creator
+ //! has been registered, or nullptr if there are no registered plugin creators.
//!
//! \usage
//! - Allowed context for the API call
@@ -992,27 +871,37 @@ class IPluginRegistry
//! \brief Return plugin creator based on plugin name, version, and
//! namespace associated with plugin during network creation.
//!
- //! \warning The strings pluginName, pluginVersion, and pluginNamespace must be 1024 bytes or less including the
- //! NULL terminator and must be NULL terminated.
+ //! \warning The strings pluginName, pluginVersion, and pluginNamespace must be NULL terminated and have a length
+ //! of 1024 bytes or less including the NULL terminator.
+ //!
+ //! \param pluginName The plugin name string
+ //! \param pluginVersion The plugin version string
+ //! \param pluginNamespace The plugin namespace (by default empty string)
+ //!
+ //! \return If a plugin creator corresponding to the passed name, version and namespace can be found in the
+ //! registry, it is returned. nullptr is returned in the following situations:
+ //! - Any of the input arguments is nullptr.
+ //! - Any of the input arguments exceeds the string length limit.
+ //! - No plugin creator corresponding to the input arguments can be found in the registry.
+ //! - A plugin creator can be found, but its stored namespace attribute does not match the pluginNamespace.
//!
//! \usage
//! - Allowed context for the API call
//! - Thread-safe: Yes
//!
virtual IPluginCreator* getPluginCreator(AsciiChar const* const pluginName, AsciiChar const* const pluginVersion,
- AsciiChar const* const pluginNamespace = "") noexcept
- = 0;
+ AsciiChar const* const pluginNamespace = "") noexcept = 0;
//!
//! \brief Set the ErrorRecorder for this interface
//!
//! Assigns the ErrorRecorder to this interface. The ErrorRecorder will track all errors during execution.
- //! This function will call incRefCount of the registered ErrorRecorder at least once. Setting
- //! recorder to nullptr unregisters the recorder with the interface, resulting in a call to decRefCount if
- //! a recorder has been registered.
+ //! This function will call incRefCount of the registered ErrorRecorder at least once. If the recorder is set to
+ //! nullptr, the error code ErrorCode::kINVALID_ARGUMENT will be emitted if the recorder has been registered.
+ //!
+ //! \param recorder The error recorder to register with this interface, or nullptr to deregister the current
+ //! recorder.
//!
- //! \param recorder The error recorder to register with this interface.
- //
//! \see getErrorRecorder()
//!
//! \usage
@@ -1028,7 +917,9 @@ class IPluginRegistry
//! so a nullptr will be returned if setErrorRecorder has not been called, or an ErrorRecorder has not been
//! inherited.
//!
- //! \return A pointer to the IErrorRecorder object that has been registered.
+ //! \return A pointer to the IErrorRecorder object that has been registered, or nullptr if:
+ //! - no error recorder has been set, or
+ //! - the last error recorder has been deregistered via setErrorRecorder(nullptr).
//!
//! \see setErrorRecorder()
//!
@@ -1045,6 +936,8 @@ class IPluginRegistry
//! this function provides a mechanism for removing plugin creators registered in TensorRT.
//! The plugin creator that is specified by \p creator is removed from TensorRT and no longer tracked.
//!
+ //! \param creator The plugin creator to deregister.
+ //!
//! \return True if the plugin creator was deregistered, false if it was not found in the registry or otherwise
//! could
//! not be deregistered.
@@ -1068,7 +961,12 @@ class IPluginRegistry
};
//!
-//! \brief Create an instance of an safe::IRuntime class.
+//! \brief Create an instance of a safe::IRuntime class.
+//!
+//! \param logger A logger object whose lifetime must exceed that of the returned runtime.
+//! Loggers must be thread-safe.
+//!
+//! \return A safe runtime object that can be used for safe plan file deserialization.
//!
//! This class is the logging class for the runtime.
//!
@@ -1093,8 +991,8 @@ extern "C" TENSORRTAPI IPluginRegistry* getSafePluginRegistry() noexcept;
//! loaded. This static object will register all creators available in the
//! library to the registry.
//!
-//! \warning Statically registering plugins should be avoided in the automotive
-//! safety context as the application developer should first register an error recorder
+//! \warning Statically registering plugins must be avoided in the automotive
+//! safety context as the application developer must first register an error recorder
//! with the plugin registry via IPluginRegistry::setErrorRecorder() before using
//! IPluginRegistry::registerCreator() or other methods.
//!
diff --git a/include/NvInferVersion.h b/include/NvInferVersion.h
index b285fd02..8c99bea7 100644
--- a/include/NvInferVersion.h
+++ b/include/NvInferVersion.h
@@ -23,26 +23,19 @@
#ifndef NV_INFER_VERSION_H
#define NV_INFER_VERSION_H
-#define NV_TENSORRT_MAJOR 8 //!< TensorRT major version.
-#define NV_TENSORRT_MINOR 6 //!< TensorRT minor version.
-#define NV_TENSORRT_PATCH 1 //!< TensorRT patch version.
-#define NV_TENSORRT_BUILD 5 //!< TensorRT build number.
+#define NV_TENSORRT_MAJOR 10 //!< TensorRT major version.
+#define NV_TENSORRT_MINOR 0 //!< TensorRT minor version.
+#define NV_TENSORRT_PATCH 0 //!< TensorRT patch version.
+#define NV_TENSORRT_BUILD 6 //!< TensorRT build number.
#define NV_TENSORRT_LWS_MAJOR 0 //!< TensorRT LWS major version.
#define NV_TENSORRT_LWS_MINOR 0 //!< TensorRT LWS minor version.
#define NV_TENSORRT_LWS_PATCH 0 //!< TensorRT LWS patch version.
-// This #define is deprecated in TensorRT 8.6 and will be removed in 10.0. Use NV_TENSORRT_MAJOR.
-#define NV_TENSORRT_SONAME_MAJOR 8 //!< Shared object library major version number.
-// This #define is deprecated in TensorRT 8.6 and will be removed in 10.0. Use NV_TENSORRT_MINOR.
-#define NV_TENSORRT_SONAME_MINOR 6 //!< Shared object library minor version number.
-// This #define is deprecated in TensorRT 8.6 and will be removed in 10.0. Use NV_TENSORRT_PATCH.
-#define NV_TENSORRT_SONAME_PATCH 1 //!< Shared object library patch version number.
-
#define NV_TENSORRT_RELEASE_TYPE_EARLY_ACCESS 0 //!< An early access release
#define NV_TENSORRT_RELEASE_TYPE_RELEASE_CANDIDATE 1 //!< A release candidate
#define NV_TENSORRT_RELEASE_TYPE_GENERAL_AVAILABILITY 2 //!< A final release
-#define NV_TENSORRT_RELEASE_TYPE NV_TENSORRT_RELEASE_TYPE_GENERAL_AVAILABILITY //!< TensorRT release type
+#define NV_TENSORRT_RELEASE_TYPE NV_TENSORRT_RELEASE_TYPE_EARLY_ACCESS //!< TensorRT release type
#endif // NV_INFER_VERSION_H
diff --git a/include/NvOnnxConfig.h b/include/NvOnnxConfig.h
index 28d8a690..8a222aa7 100644
--- a/include/NvOnnxConfig.h
+++ b/include/NvOnnxConfig.h
@@ -49,6 +49,7 @@ class IOnnxConfig
virtual ~IOnnxConfig() noexcept = default;
//!
//! \typedef Verbosity
+ //!
//! \brief Defines Verbosity level.
//!
typedef int32_t Verbosity;
@@ -188,15 +189,6 @@ class IOnnxConfig
//!
virtual void setPrintLayerInfo(bool) noexcept = 0;
- //!
- //! \brief Destroy IOnnxConfig object.
- //!
- //! \deprecated Use `delete` instead. Deprecated in TRT 8.0.
- //!
- //! \warning Calling destroy on a managed pointer will result in a double-free error.
- //!
- TRT_DEPRECATED virtual void destroy() noexcept = 0;
-
}; // class IOnnxConfig
TENSORRTAPI IOnnxConfig* createONNXConfig();
diff --git a/include/NvUffParser.h b/include/NvUffParser.h
deleted file mode 100644
index 468895c2..00000000
--- a/include/NvUffParser.h
+++ /dev/null
@@ -1,230 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#ifndef NV_UFF_PARSER_H
-#define NV_UFF_PARSER_H
-
-#include "NvInfer.h"
-
-//!
-//! \file NvUffParser.h
-//!
-//! This is the API for the UFF Parser
-//!
-
-// Current supported Universal Framework Format (UFF) version for the parser.
-#define UFF_REQUIRED_VERSION_MAJOR 0
-#define UFF_REQUIRED_VERSION_MINOR 6
-#define UFF_REQUIRED_VERSION_PATCH 9
-
-//!
-//! \namespace nvuffparser
-//!
-//! \brief The TensorRT UFF parser API namespace.
-//!
-namespace nvuffparser
-{
-
-//!
-//! \enum UffInputOrder
-//! \brief The different possible supported input order.
-//!
-enum class UffInputOrder : int32_t
-{
- kNCHW = 0, //!< NCHW order.
- kNHWC = 1, //!< NHWC order.
- kNC = 2 //!< NC order.
-};
-
-//!
-//! \enum FieldType
-//! \brief The possible field types for custom layer.
-//!
-
-enum class FieldType : int32_t
-{
- kFLOAT = 0, //!< FP32 field type.
- kINT32 = 1, //!< INT32 field type.
- kCHAR = 2, //!< char field type. String for length>1.
- kDIMS = 4, //!< nvinfer1::Dims field type.
- kDATATYPE = 5, //!< nvinfer1::DataType field type.
- kUNKNOWN = 6
-};
-
-//!
-//! \class FieldMap
-//!
-//! \brief An array of field params used as a layer parameter for plugin layers.
-//!
-//! The node fields are passed by the parser to the API through the plugin
-//! constructor. The implementation of the plugin should parse the contents of
-//! the fieldMap as part of the plugin constructor
-//!
-class TENSORRTAPI FieldMap
-{
-public:
- char const* name{};
- void const* data{};
- FieldType type{FieldType::kUNKNOWN};
- int32_t length{1};
-
- //! \deprecated Legacy constructor, retained for ABI compatibility. Deprecated in TensorRT 8.6.
- //! Use the default constructor instead.
- TRT_DEPRECATED FieldMap(char const* name, void const* data, FieldType const type, int32_t length = 1);
-
- //! Default constructor
- FieldMap() = default;
-};
-
-struct FieldCollection
-{
- int32_t nbFields;
- FieldMap const* fields;
-};
-
-//!
-//! \class IUffParser
-//!
-//! \brief Class used for parsing models described using the UFF format.
-//!
-//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
-//!
-class IUffParser
-{
-public:
- //!
- //! \brief Register an input name of a UFF network with the associated Dimensions.
- //!
- //! \param inputName Input name.
- //! \param inputDims Input dimensions.
- //! \param inputOrder Input order on which the framework input was originally.
- //!
- virtual bool registerInput(char const* inputName, nvinfer1::Dims inputDims, UffInputOrder inputOrder) noexcept = 0;
-
- //!
- //! \brief Register an output name of a UFF network.
- //!
- //! \param outputName Output name.
- //!
- virtual bool registerOutput(char const* outputName) noexcept = 0;
-
- //!
- //! \brief Parse a UFF file.
- //!
- //! \param file File name of the UFF file.
- //! \param network Network in which the UFFParser will fill the layers.
- //! \param weightsType The type on which the weights will transformed in.
- //!
- virtual bool parse(char const* file, nvinfer1::INetworkDefinition& network,
- nvinfer1::DataType weightsType = nvinfer1::DataType::kFLOAT) noexcept = 0;
-
- //!
- //! \brief Parse a UFF buffer, useful if the file already live in memory.
- //!
- //! \param buffer Buffer of the UFF file.
- //! \param size Size of buffer of the UFF file.
- //! \param network Network in which the UFFParser will fill the layers.
- //! \param weightsType The type on which the weights will transformed in.
- //!
- virtual bool parseBuffer(char const* buffer, std::size_t size, nvinfer1::INetworkDefinition& network,
- nvinfer1::DataType weightsType = nvinfer1::DataType::kFLOAT) noexcept = 0;
-
- //!
- //! \deprecated Use `delete` instead. Deprecated in TRT 8.0.
- //!
- TRT_DEPRECATED virtual void destroy() noexcept = 0;
-
- //!
- //! \brief Return Version Major of the UFF.
- //!
- virtual int32_t getUffRequiredVersionMajor() noexcept = 0;
-
- //!
- //! \brief Return Version Minor of the UFF.
- //!
- virtual int32_t getUffRequiredVersionMinor() noexcept = 0;
-
- //!
- //! \brief Return Patch Version of the UFF.
- //!
- virtual int32_t getUffRequiredVersionPatch() noexcept = 0;
-
- //!
- //! \brief Set the namespace used to lookup and create plugins in the network.
- //!
- virtual void setPluginNamespace(char const* libNamespace) noexcept = 0;
-
- virtual ~IUffParser() noexcept = default;
-
-public:
- //!
- //! \brief Set the ErrorRecorder for this interface
- //!
- //! Assigns the ErrorRecorder to this interface. The ErrorRecorder will track all errors during execution.
- //! This function will call incRefCount of the registered ErrorRecorder at least once. Setting
- //! recorder to nullptr unregisters the recorder with the interface, resulting in a call to decRefCount if
- //! a recorder has been registered.
- //!
- //! If an error recorder is not set, messages will be sent to the global log stream.
- //!
- //! \param recorder The error recorder to register with this interface.
- //
- //! \see getErrorRecorder()
- //!
- virtual void setErrorRecorder(nvinfer1::IErrorRecorder* recorder) noexcept = 0;
-
- //!
- //! \brief get the ErrorRecorder assigned to this interface.
- //!
- //! Retrieves the assigned error recorder object for the given class. A
- //! nullptr will be returned if setErrorRecorder has not been called.
- //!
- //! \return A pointer to the IErrorRecorder object that has been registered.
- //!
- //! \see setErrorRecorder()
- //!
- virtual nvinfer1::IErrorRecorder* getErrorRecorder() const noexcept = 0;
-};
-
-//!
-//! \brief Creates a IUffParser object.
-//!
-//! \return A pointer to the IUffParser object is returned.
-//!
-//! \see nvuffparser::IUffParser
-//!
-//! \deprecated IUffParser will be removed in TensorRT 9.0. Plan to migrate your workflow to
-//! use nvonnxparser::IParser for deployment.
-//!
-TENSORRTAPI IUffParser* createUffParser() noexcept;
-
-//!
-//! \brief Shuts down protocol buffers library.
-//!
-//! \note No part of the protocol buffers library can be used after this function is called.
-//!
-TENSORRTAPI void shutdownProtobufLibrary(void) noexcept;
-
-} // namespace nvuffparser
-
-//!
-//! Internal C entry point for creating IUffParser
-//! @private
-//!
-extern "C" TENSORRTAPI void* createNvUffParser_INTERNAL() noexcept;
-
-#endif /* !NV_UFF_PARSER_H */
diff --git a/include/NvUtils.h b/include/NvUtils.h
deleted file mode 100644
index be879031..00000000
--- a/include/NvUtils.h
+++ /dev/null
@@ -1,151 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#ifndef NV_UTILS_H
-#define NV_UTILS_H
-
-#include "NvInfer.h"
-
-//!
-//! \file NvUtils.h
-//!
-//! This file includes various utility functions
-//!
-
-namespace nvinfer1
-{
-namespace utils
-{
-
-//!
-//! \param input The input weights to reshape.
-//! \param shape The shape of the weights.
-//! \param shapeOrder The order of the dimensions to process for the output.
-//! \param data The location where the output data is placed.
-//! \param nbDims The number of dimensions to process.
-//!
-//! \brief Reformat the input weights of the given shape based on the new
-//! order of dimensions.
-//!
-//! Take the weights specified by \p input with the dimensions specified by
-//! \p shape and re-order the weights based on the new dimensions specified
-//! by \p shapeOrder. The size of each dimension and the input data is not
-//! modified. The output volume pointed to by \p data must be the same as
-//! he \p input volume.
-//!
-//! Example usage:
-//! float *out = new float[N*C*H*W];
-//! Weights input{DataType::kFLOAT, {0 ... N*C*H*W-1}, N*C*H*W size};
-//! int32_t order[4]{1, 0, 3, 2};
-//! int32_t shape[4]{C, N, W, H};
-//! reshapeWeights(input, shape, order, out, 4);
-//! Weights reshaped{input.type, out, input.count};
-//!
-//! Input Matrix{3, 2, 3, 2}:
-//! { 0 1}, { 2 3}, { 4 5} <-- {0, 0, *, *}
-//! { 6 7}, { 8 9}, {10 11} <-- {0, 1, *, *}
-//! {12 13}, {14 15}, {16 17} <-- {1, 0, *, *}
-//! {18 19}, {20 21}, {22 23} <-- {1, 1, *, *}
-//! {24 25}, {26 27}, {28 29} <-- {2, 0, *, *}
-//! {30 31}, {32 33}, {34 35} <-- {2, 1, *, *}
-//!
-//! Output Matrix{2, 3, 2, 3}:
-//! { 0 2 4}, { 1 3 5} <-- {0, 0, *, *}
-//! {12 14 16}, {13 15 17} <-- {0, 1, *, *}
-//! {24 26 28}, {25 27 29} <-- {0, 2, *, *}
-//! { 6 8 10}, { 7 9 11} <-- {1, 0, *, *}
-//! {18 20 22}, {19 21 23} <-- {1, 1, *, *}
-//! {30 32 34}, {31 33 35} <-- {1, 2, *, *}
-//!
-//! \return True on success, false on failure.
-//!
-//! \deprecated Deprecated in TensorRT 8.0.
-//!
-//! \warning This file will be removed in TensorRT 10.0.
-//!
-TRT_DEPRECATED TENSORRTAPI bool reshapeWeights(
- Weights const& input, int32_t const* shape, int32_t const* shapeOrder, void* data, int32_t nbDims) noexcept;
-
-//!
-//! \param input The input data to re-order.
-//! \param order The new order of the data sub-buffers.
-//! \param num The number of data sub-buffers to re-order.
-//! \param size The size of each data sub-buffer in bytes.
-//!
-//! \brief Takes an input stream and re-orders \p num chunks of the data
-//! given the \p size and \p order.
-//!
-//! In some frameworks, the ordering of the sub-buffers within a dimension
-//! is different than the way that TensorRT expects them.
-//! TensorRT expects the gate/bias sub-buffers for LSTM's to be in fico order.
-//! TensorFlow however formats the sub-buffers in icfo order.
-//! This helper function solves this in a generic fashion.
-//!
-//! Example usage output of reshapeWeights above:
-//! int32_t indir[1]{1, 0}
-//! int32_t stride = W*H;
-//! for (int32_t x = 0, y = N*C; x < y; ++x)
-//! reorderSubBuffers(out + x * stride, indir, H, W);
-//!
-//! Input Matrix{2, 3, 2, 3}:
-//! { 0 2 4}, { 1 3 5} <-- {0, 0, *, *}
-//! {12 14 16}, {13 15 17} <-- {0, 1, *, *}
-//! {24 26 28}, {25 27 29} <-- {0, 2, *, *}
-//! { 6 8 10}, { 7 9 11} <-- {1, 0, *, *}
-//! {18 20 22}, {19 21 23} <-- {1, 1, *, *}
-//! {30 32 34}, {31 33 35} <-- {1, 2, *, *}
-//!
-//! Output Matrix{2, 3, 2, 3}:
-//! { 1 3 5}, { 0 2 4} <-- {0, 0, *, *}
-//! {13 15 17}, {12 14 16} <-- {0, 1, *, *}
-//! {25 27 29}, {24 26 28} <-- {0, 2, *, *}
-//! { 7 9 11}, { 6 8 10} <-- {1, 0, *, *}
-//! {19 21 23}, {18 20 22} <-- {1, 1, *, *}
-//! {31 33 35}, {30 32 34} <-- {1, 2, *, *}
-//!
-//! \return True on success, false on failure.
-//!
-//! \see reshapeWeights()
-//!
-//! \deprecated Deprecated in TensorRT 8.0.
-//!
-//! \warning This file will be removed in TensorRT 10.0.
-//!
-TRT_DEPRECATED TENSORRTAPI bool reorderSubBuffers(
- void* input, int32_t const* order, int32_t num, int32_t size) noexcept;
-
-//!
-//! \param input The input data to transpose.
-//! \param type The type of the data to transpose.
-//! \param num The number of data sub-buffers to transpose.
-//! \param height The size of the height dimension to transpose.
-//! \param width The size of the width dimension to transpose.
-//!
-//! \brief Transpose \p num sub-buffers of \p height * \p width.
-//!
-//! \return True on success, false on failure.
-//!
-//! \deprecated Deprecated in TensorRT 8.0.
-//!
-//! \warning This file will be removed in TensorRT 10.0.
-//!
-TRT_DEPRECATED TENSORRTAPI bool transposeSubBuffers(
- void* input, DataType type, int32_t num, int32_t height, int32_t width) noexcept;
-
-} // namespace utils
-} // namespace nvinfer1
-#endif // NV_UTILS_H
diff --git a/parsers/CMakeLists.txt b/parsers/CMakeLists.txt
index 5dab1c9f..750942e6 100644
--- a/parsers/CMakeLists.txt
+++ b/parsers/CMakeLists.txt
@@ -15,12 +15,9 @@
# limitations under the License.
#
+############################# GENERATE C++ PROTO FILES ###################################
add_custom_target(parsers DEPENDS
- nvcaffeparserlibs
- nvonnxparser
-)
-
-add_subdirectory(caffe)
+ nvonnxparser)
add_definitions("-D_PROTOBUF_INSTALL_DIR=${Protobuf_INSTALL_DIR}")
add_compile_options("-Dgoogle=google_private")
diff --git a/parsers/caffe/CMakeLists.txt b/parsers/caffe/CMakeLists.txt
deleted file mode 100644
index f6abda79..00000000
--- a/parsers/caffe/CMakeLists.txt
+++ /dev/null
@@ -1,144 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-############################# GENERATE C++ PROTO FILES ###################################
-protobuf_generate_cpp(CAFFE_PROTO_SRC CAFFE_PROTO_HDR proto/trtcaffe.proto)
-add_custom_target(caffe_proto
- DEPENDS
- ${CAFFE_PROTO_SRC} ${CAFFE_PROTO_HDR}
-)
-############################## BUILD CAFFE PARSER ########################################
-add_custom_target(nvcaffeparserlibs)
-
-set(TARGET_NAME nvcaffeparser)
-set(SHARED_TARGET ${TARGET_NAME})
-set(STATIC_TARGET ${TARGET_NAME}_static)
-
-################################# DEFINE SOURCES ########################################
-include(CaffeParserSources.txt)
-#########################################################################################
-
-################################## SHARED LIBRARY #######################################
-
-add_library(${SHARED_TARGET} SHARED
- ${CAFFE_PARSER_SRCS}
-)
-
-add_dependencies(${SHARED_TARGET} caffe_proto)
-
-target_include_directories(${SHARED_TARGET}
- PUBLIC ${PROJECT_SOURCE_DIR}/include
- PRIVATE .
- PRIVATE caffeParser
- PRIVATE caffeParser/opParsers
- PRIVATE caffeWeightFactory
- PRIVATE ../common
- PRIVATE ${Protobuf_INCLUDE_DIR}
- PRIVATE ${CMAKE_CURRENT_BINARY_DIR}/proto
-)
-
-set_target_properties(${SHARED_TARGET}
- PROPERTIES
- CXX_STANDARD 11
- CXX_STANDARD_REQUIRED YES
- CXX_EXTENSIONS NO
- ARCHIVE_OUTPUT_DIRECTORY "${TRT_OUT_DIR}"
- LIBRARY_OUTPUT_DIRECTORY "${TRT_OUT_DIR}"
- RUNTIME_OUTPUT_DIRECTORY "${TRT_OUT_DIR}"
-)
-
-target_link_libraries(${SHARED_TARGET}
- ${Protobuf_LIBRARY}
- nvinfer
-)
-
-# modify google namespace to avoid namespace collision.
-set(GOOGLE google_private)
-target_compile_definitions(${SHARED_TARGET}
- PRIVATE
- "-Dgoogle=${GOOGLE}"
- "-DGOOGLE_PROTOBUF_ARCH_64_BIT"
-)
-
-set_target_properties(${SHARED_TARGET} PROPERTIES LINK_FLAGS "-Wl,--exclude-libs,ALL")
-
-set_target_properties(${SHARED_TARGET} PROPERTIES DEBUG_POSTFIX ${TRT_DEBUG_POSTFIX})
-
-set_target_properties(${SHARED_TARGET} PROPERTIES VERSION ${TRT_VERSION} SOVERSION ${TRT_SOVERSION} )
-
-set_property(TARGET ${SHARED_TARGET} PROPERTY CUDA_STANDARD 11)
-
-################################## STATIC LIBRARY #######################################
-
-add_library(${STATIC_TARGET} STATIC
- ${CAFFE_PARSER_SRCS}
-)
-
-add_dependencies(${STATIC_TARGET} caffe_proto)
-
-target_include_directories(${STATIC_TARGET}
- PUBLIC ${PROJECT_SOURCE_DIR}/include
- PRIVATE .
- PRIVATE caffeParser
- PRIVATE caffeParser/opParsers
- PRIVATE caffeWeightFactory
- PRIVATE ../common
- PRIVATE ${Protobuf_INCLUDE_DIR}
- PRIVATE ${CMAKE_CURRENT_BINARY_DIR}/proto
-)
-
-set_target_properties(${STATIC_TARGET}
- PROPERTIES
- CXX_STANDARD 11
- CXX_STANDARD_REQUIRED YES
- CXX_EXTENSIONS NO
- ARCHIVE_OUTPUT_DIRECTORY "${TRT_OUT_DIR}"
- LIBRARY_OUTPUT_DIRECTORY "${TRT_OUT_DIR}"
- RUNTIME_OUTPUT_DIRECTORY "${TRT_OUT_DIR}"
-)
-
-target_link_libraries(${STATIC_TARGET}
- ${Protobuf_LIBRARY}
-)
-
-# modify google namespace to avoid namespace collision.
-set(GOOGLE google_private)
-target_compile_definitions(${STATIC_TARGET}
- PRIVATE
- "-Dgoogle=${GOOGLE}"
- "-DGOOGLE_PROTOBUF_ARCH_64_BIT"
-)
-
-set_target_properties(${STATIC_TARGET} PROPERTIES LINK_FLAGS "-Wl,--exclude-libs,ALL")
-
-set_target_properties(${STATIC_TARGET} PROPERTIES DEBUG_POSTFIX ${TRT_DEBUG_POSTFIX})
-
-set_target_properties(${STATIC_TARGET} PROPERTIES VERSION ${TRT_VERSION} SOVERSION ${TRT_SOVERSION} )
-
-set_property(TARGET ${STATIC_TARGET} PROPERTY CUDA_STANDARD 11)
-
-#########################################################################################
-
-add_dependencies(nvcaffeparserlibs ${SHARED_TARGET} ${STATIC_TARGET})
-
-################################### INSTALLATION ########################################
-
-install(TARGETS ${TARGET_NAME}
- RUNTIME DESTINATION bin
- LIBRARY DESTINATION lib
- ARCHIVE DESTINATION lib
-)
diff --git a/parsers/caffe/CaffeParserSources.txt b/parsers/caffe/CaffeParserSources.txt
deleted file mode 100644
index b7f69743..00000000
--- a/parsers/caffe/CaffeParserSources.txt
+++ /dev/null
@@ -1,46 +0,0 @@
-#
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-set(CAFFE_PARSER_SRCS
- ${CAFFE_PROTO_SRC}
- caffeParser/opParsers/opParsers.h
- caffeParser/opParsers/parseAbsVal.cpp
- caffeParser/opParsers/parseBatchNorm.cpp
- caffeParser/opParsers/parseBNLL.cpp
- caffeParser/opParsers/parseClip.cpp
- caffeParser/opParsers/parseConcat.cpp
- caffeParser/opParsers/parseConv.cpp
- caffeParser/opParsers/parseCrop.cpp
- caffeParser/opParsers/parseDeconv.cpp
- caffeParser/opParsers/parseEltwise.cpp
- caffeParser/opParsers/parseELU.cpp
- caffeParser/opParsers/parseInnerProduct.cpp
- caffeParser/opParsers/parseLRN.cpp
- caffeParser/opParsers/parsePermute.cpp
- caffeParser/opParsers/parsePooling.cpp
- caffeParser/opParsers/parsePower.cpp
- caffeParser/opParsers/parsePReLU.cpp
- caffeParser/opParsers/parseReduction.cpp
- caffeParser/opParsers/parseReLU.cpp
- caffeParser/opParsers/parseReshape.cpp
- caffeParser/opParsers/parseScale.cpp
- caffeParser/opParsers/parseSigmoid.cpp
- caffeParser/opParsers/parseSoftMax.cpp
- caffeParser/opParsers/parseTanH.cpp
- caffeWeightFactory/caffeWeightFactory.cpp
- caffeParser/caffeParser.cpp
- NvCaffeParser.cpp
-)
diff --git a/parsers/caffe/binaryProtoBlob.h b/parsers/caffe/binaryProtoBlob.h
deleted file mode 100644
index 79ec2976..00000000
--- a/parsers/caffe/binaryProtoBlob.h
+++ /dev/null
@@ -1,67 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#ifndef TRT_CAFFE_PARSER_BINARY_PROTO_BLOB_H
-#define TRT_CAFFE_PARSER_BINARY_PROTO_BLOB_H
-#include
-
-#include "NvCaffeParser.h"
-#include "NvInfer.h"
-
-namespace nvcaffeparser1
-{
-class BinaryProtoBlob : public IBinaryProtoBlob
-{
-public:
- BinaryProtoBlob(void* memory, nvinfer1::DataType type, nvinfer1::Dims4 dimensions)
- : mMemory(memory)
- , mDataType(type)
- , mDimensions(dimensions)
- {
- }
-
- nvinfer1::Dims4 getDimensions() noexcept override
- {
- return mDimensions;
- }
-
- nvinfer1::DataType getDataType() noexcept override
- {
- return mDataType;
- }
-
- const void* getData() noexcept override
- {
- return mMemory;
- }
-
- void destroy() noexcept override
- {
- delete this;
- }
-
- ~BinaryProtoBlob() noexcept override
- {
- free(mMemory);
- }
-
- void* mMemory;
- nvinfer1::DataType mDataType;
- nvinfer1::Dims4 mDimensions;
-};
-} // namespace nvcaffeparser1
-#endif // TRT_CAFFE_PARSER_BINARY_PROTO_BLOB_H
diff --git a/parsers/caffe/blobNameToTensor.h b/parsers/caffe/blobNameToTensor.h
deleted file mode 100644
index d685cced..00000000
--- a/parsers/caffe/blobNameToTensor.h
+++ /dev/null
@@ -1,72 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#ifndef TRT_CAFFE_PARSER_BLOB_NAME_TO_TENSOR_H
-#define TRT_CAFFE_PARSER_BLOB_NAME_TO_TENSOR_H
-
-#include